Performance Problems

**CMK version:2.1.0p38.cee
Operating system version: CentOS7

Our Check_Mk is slow. Any change takes a lot of time. And sometimes the server also crashes and is displayed as dead.
If I use omd status, I see that some services restart randomly

Where should I look, to debug performance in checkmk ?

Hi there, and welcome to the forum.

Two observations :eyes: to begin with:

  1. Your OS ( CentOS7) is EOL since June 30, 2024
  2. Your CheckMK version is out of active maintenance ( see https://docs.checkmk.com/latest/en/cmk_versions.html ), and even getting close to the end-date of passive maintenance !

Regarding the above i would recommend starting with updates/upgrades, as newer version solve a lot of issues.

Next to that you have not given any detail regarding the sizing of the server CMK is running on, making it very hard for us on the forum to even make a wild guess.

Please share some more details :slight_smile:

  • Glowsome
3 Likes

Hi, we are running about 300 instances with 2.1.0 on CentOS 7 without any issues. I would recommend to start looking at the log files in the ~/var/log of the site user. It depends on the service which is crashing but I recommend starting with web.log and cmc.log. In general settings you can even increase the log level to debug for certain services.
Beside that you need a minimum of 4 cores, 8 GByte of RAM and free diskspace.

regards

Michael

Thanks

We have 4 cmk server with a total of 634 Host.
On the server which are slow are:
346 host
20 CPU’s
32 RAM

i check the logs, but i not sure what is costing the Performance Problems.

1

on ~/var/log/cmc.log, we get multiple errors regarding

[rrdcached] [rrdcached at "/omd/sites/$site/tmp/run/rrdcached.sock"] [log] -1 No such file: /data/omd/sites/$site/var/pnp4nagios/perfdata/$hostname/Filesystem__dev_shm_fs_size.rrd

i found this doku https://checkmk.atlassian.net/wiki/spaces/KB/pages/9471122/How-to+debug+graphs+not+being+created, but this comment does not work, it’s saying mysite not found. Is mysite a checkmk command ?
for i in $( find ~ /var/check_mk/rrd -name *.info); do mysite -e ${i%info}rrd || echo $i; done

2

On ~/var/log/alerts.log I get that the Global handler is time outing, this is not helpful for me, can you get what that means ?

2024-09-16 08:44:34,835 [20] [cmk.base.alert_handling] ----------------------------------------------------------------------
2024-09-16 08:44:34,835 [20] [cmk.base.alert_handling] Starting alert handler helper.
2024-09-16 08:44:34,835 [20] [cmk.base.alert_handling] Global handler timeout: 60 sec (TERM), 120 sec (KILL)
2024-09-16 08:44:34,877 [20] [cmk.base.events] Starting in keepalive mode with PID 14682

Is rrdchached is running?
What does rrdcached.log says?

If i use omd status, rrdcached is running

And rrdcached.log are empty!

If I look in the GUI, it’s showing that is sometimes restart and starts again

Service alert 2024-09-13 11:02:44 SERVICE ALERT HARD (OK) running
Service alert 2024-09-13 11:01:36 SERVICE ALERT SOFT (CRITICAL) partially running, stopped services: rrdcached, cmc
Service alert 2024-09-13 08:51:17 SERVICE ALERT SOFT (CRITICAL) partially running, stopped services: rrdcached, cmc
Service alert 2024-09-13 08:02:57 SERVICE ALERT HARD (OK) running
Service alert 2024-09-13 08:01:45 SERVICE ALERT SOFT (CRITICAL) partially running, stopped services: rrdcached, cmc
Service alert 2024-09-13 07:36:12 SERVICE ALERT HARD (OK) running
Service alert 2024-09-13 07:35:03 SERVICE ALERT SOFT (CRITICAL) partially running, stopped services: rrdcached, apache, dcd, redis, xinetd, crontab
Service alert 2024-09-13 07:31:39 SERVICE ALERT HARD (OK) running
Service alert 2024-09-13 07:30:27 SERVICE ALERT SOFT (CRITICAL) partially running, stopped services: cmc

May you can try the following.

Find with ps the rrdcached. Kill the process, copy teh command line and add -g -V LOG_DEBUG

That way rrdcached runs into foreground with debug log.

Here it shows something like this:

starting up
checking for journal files
replaying from journal: /opt/omd/sites/test/var/rrdcached/rrd.journal.1726560620.946546
Replayed 12126 entries (64296 failures)
replaying from journal: /opt/omd/sites/test/var/rrdcached/rrd.journal.1726567820.947308
Replayed 13653 entries (2873 failures)
replaying from journal: /opt/omd/sites/test/var/rrdcached/rrd.journal.1726569365.484252
Replayed 1300 entries (0 failures)
started new journal /opt/omd/sites/test/var/rrdcached/rrd.journal.1726569515.466606
journal processing complete
listening for connections


Hello mike,

Yes, I get a lot of errors.

starting up
checking for journal files
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726565091.678467
Replayed 25358 entries (1328731 failures)
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726572291.678848
Replayed 307534 entries (975047 failures)
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726579417.913437
Replayed 515367 entries (308101 failures)
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726584088.666084
Replayed 0 entries (0 failures)
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726584214.798012
Replayed 22249 entries (0 failures)
started new journal /data/omd/sites/$site/var/rrdcached/rrd.journal.1726584404.961194
journal processing complete
listening for connections

in file /data/omd/sites/$site/var/rrdcached/rrd.journal.1726565091.678467 are a loot off erros like this
update /data/omd/sites/$site/var/check_mk/rrd/$hostname/NTP.rrd 1726565295:-0.000899:2:2:0:0

and this

update /data/omd/sites/$site/var/pnp4nagios/perfdata/$hostname/Notify_number_of_notifications_per_contact_w_20_c_50_genomics_servicedesk_num.rrd 1726565295:22

The $site looks wrong to me. In my case its the real full path.
Can you post the command line from ps output pls

Check in file ~/etc/init.d/rrdcached following lines:

# Please do not touch the code below
CACHE_DIR="/omd/sites/test/tmp/rrdcached"
JOURNAL_DIR="/omd/sites/test/var/rrdcached"
SOCKET="/omd/sites/test/tmp/run/rrdcached.sock"
PIDFILE="/omd/sites/test/tmp/rrdcached.pid"
LOGFILE="/omd/sites/test/var/log/rrdcached.log"
USER="test"
GROUP="test"

My site name is ‘test’, please replace with your site name.

This file looks exactly like yours

~/etc/init.d/rrdcached following lines:


# Please do not touch the code below
CACHE_DIR="/omd/sites/myserver/tmp/rrdcached"
JOURNAL_DIR="/omd/sites/myserver/var/rrdcached"
SOCKET="/omd/sites/myserver/tmp/run/rrdcached.sock"
PIDFILE="/omd/sites/myserver/tmp/rrdcached.pid"
LOGFILE="/omd/sites/myserver/var/log/rrdcached.log"
USER="myserver"
GROUP="myserver"
OPTS="-t $WRITE_THREADS -w $TIMEOUT -z $RANDOM_DELAY -f $FLUSH_TIMEOUT -s $GROUP -m 660 -l unix:$SOCKET -p $PIDFILE -j $JOURNAL_DIR -o $LOGFILE"
DAEMON="/omd/sites/myserver/bin/rrdcached"

“/omd/sites/myserver/tmp/rrdcached” empty

and the command is

/omd/sites/myserver/bin/rrdcached -t 4 -w 3600 -z 1800 -f 7200 -s myserver -m 660 -l unix:/omd/sites/myserver/tmp/run/rrdcached.sock -p /omd/sites/myserver/tmp/rrdcached.pid -j /omd/sites/myserver/var/rrdcached -o /omd/sites/myserver/var/log/rrdcached.log

i replayed the site name with mysite

This should solve your RRD issue at least. “mysite” is a placeholder for the name of your site.
In general, when looking at the official user guide or the knowledge base, strings starting with “my” always indicate a placeholder.

1 Like

@robin.gierse I’ve seen you edited that KB article. Could you please also replace the wrong command mysite with test:

@Timon is not the first one who stepped into this trap:

2 Likes

Done, thanks for noticing! That totally slipped through my proofreading. :see_no_evil:

1 Like

Thanks, I was already wondering :smiley:

The RRD are fixed now Thanks

2 Likes