Process monitoring to get what caused critical load on server

**CMK version:cee 2.2 p14
**OS version:ubuntu 16.04-22.04 (all lts editions)

I tried to use Per process CPU monitoring - #12 by rawiriblundell this and got it working, but i can’t find anywhere any history data.
Say forexample at every night around 22:00pm server warns of criticcal load, we would want to know what process was causing this issue at that time, any recommendations how to get this data in checkmk ?

Hi Lilian,

go to your Service detail view and then onto “Events of service”

image

(view_name=svcevents)

I’m in
“Monitor > Overview > All hosts > server x > Services of Host > Service > Events of service server x, Top_5_CPU_No.1”
(the one in your picture), but it’s empty ?

Did your service already change from OK to something else? If not, then you won’t find anything there.

PS: You can fake a check result so as to verify that you will see anything in this view.

Problem is, this service never goes critical, there’s a separate service that monitors overall load (which goes critical when something is wrong), and this one just shows top5 services at any specific moment, but as some other people were thinking, it’s kinda useless data if i can’t look back in history what the service was at that specific moment.

Everyone kept mentioning service history but yeah, if it doesn’t show changes in what services were in top 5 at what moment, then i guess i can remove it and have to figure out another way.

Hello Lilian. Also also had my doubt with the “TOP5 Local Script” as this does not have a useful Service Event History (Tested on 2.1.0 cee)… so i start some researching and maybe found a solution which is “out-of-the-box” to analyse critical loads cause on single servers. Here i created an "Enforced Service → State and count of processes with the below Parameters to get insights of the current running processes. Alerting Thresholds can be edited as needed.

The Service Output contains now Agents Long Output:

Now i enable to keep the Long Output in MOnitoring History using the Rule “Write long output of services to monitoring history” for the ps service.

By enhancing the View of “History → Service Events” with
grafik

i can see the complete Agent Long Output at the Moment of a State Change on Service Summary.

I would propose to do this only on the critical hosts during a limited timeframe as this can impact overall cmk performance.

Thanks, i’ll try this!