Performance Problems

Timon · September 16, 2024, 2:38pm

**CMK version:2.1.0p38.cee
Operating system version: CentOS7

Our Check_Mk is slow. Any change takes a lot of time. And sometimes the server also crashes and is displayed as dead.
If I use omd status, I see that some services restart randomly

Where should I look, to debug performance in checkmk ?

Glowsome · September 16, 2024, 3:24pm

Hi there, and welcome to the forum.

Two observations to begin with:

Your OS ( CentOS7) is EOL since June 30, 2024
Your CheckMK version is out of active maintenance ( see https://docs.checkmk.com/latest/en/cmk_versions.html ), and even getting close to the end-date of passive maintenance !

Regarding the above i would recommend starting with updates/upgrades, as newer version solve a lot of issues.

Next to that you have not given any detail regarding the sizing of the server CMK is running on, making it very hard for us on the forum to even make a wild guess.

Please share some more details

Glowsome

mike1098 · September 17, 2024, 6:24am

Hi, we are running about 300 instances with 2.1.0 on CentOS 7 without any issues. I would recommend to start looking at the log files in the ~/var/log of the site user. It depends on the service which is crashing but I recommend starting with web.log and cmc.log. In general settings you can even increase the log level to debug for certain services.
Beside that you need a minimum of 4 cores, 8 GByte of RAM and free diskspace.

regards

Michael

Timon · September 17, 2024, 6:48am

Thanks

We have 4 cmk server with a total of 634 Host.
On the server which are slow are:
346 host
20 CPU’s
32 RAM

i check the logs, but i not sure what is costing the Performance Problems.

1

on ~/var/log/cmc.log, we get multiple errors regarding

[rrdcached] [rrdcached at "/omd/sites/$site/tmp/run/rrdcached.sock"] [log] -1 No such file: /data/omd/sites/$site/var/pnp4nagios/perfdata/$hostname/Filesystem__dev_shm_fs_size.rrd

i found this doku https://checkmk.atlassian.net/wiki/spaces/KB/pages/9471122/How-to+debug+graphs+not+being+created, but this comment does not work, it’s saying mysite not found. Is mysite a checkmk command ?
for i in $( find ~ /var/check_mk/rrd -name *.info); do mysite -e ${i%info}rrd || echo $i; done

2

On ~/var/log/alerts.log I get that the Global handler is time outing, this is not helpful for me, can you get what that means ?

2024-09-16 08:44:34,835 [20] [cmk.base.alert_handling] ----------------------------------------------------------------------
2024-09-16 08:44:34,835 [20] [cmk.base.alert_handling] Starting alert handler helper.
2024-09-16 08:44:34,835 [20] [cmk.base.alert_handling] Global handler timeout: 60 sec (TERM), 120 sec (KILL)
2024-09-16 08:44:34,877 [20] [cmk.base.events] Starting in keepalive mode with PID 14682

mike1098 · September 17, 2024, 9:06am

Is rrdchached is running?
What does rrdcached.log says?

Timon · September 17, 2024, 9:28am

If i use omd status, rrdcached is running

And rrdcached.log are empty!

If I look in the GUI, it’s showing that is sometimes restart and starts again

Service alert	2024-09-13 11:02:44	SERVICE ALERT	HARD (OK)	running
Service alert	2024-09-13 11:01:36	SERVICE ALERT	SOFT (CRITICAL)	partially running, stopped services: rrdcached, cmc
Service alert	2024-09-13 08:51:17	SERVICE ALERT	SOFT (CRITICAL)	partially running, stopped services: rrdcached, cmc
Service alert	2024-09-13 08:02:57	SERVICE ALERT	HARD (OK)	running
Service alert	2024-09-13 08:01:45	SERVICE ALERT	SOFT (CRITICAL)	partially running, stopped services: rrdcached, cmc
Service alert	2024-09-13 07:36:12	SERVICE ALERT	HARD (OK)	running
Service alert	2024-09-13 07:35:03	SERVICE ALERT	SOFT (CRITICAL)	partially running, stopped services: rrdcached, apache, dcd, redis, xinetd, crontab
Service alert	2024-09-13 07:31:39	SERVICE ALERT	HARD (OK)	running
Service alert	2024-09-13 07:30:27	SERVICE ALERT	SOFT (CRITICAL)	partially running, stopped services: cmc

mike1098 · September 17, 2024, 10:42am

May you can try the following.

Find with ps the rrdcached. Kill the process, copy teh command line and add -g -V LOG_DEBUG

That way rrdcached runs into foreground with debug log.

Here it shows something like this:

starting up
checking for journal files
replaying from journal: /opt/omd/sites/test/var/rrdcached/rrd.journal.1726560620.946546
Replayed 12126 entries (64296 failures)
replaying from journal: /opt/omd/sites/test/var/rrdcached/rrd.journal.1726567820.947308
Replayed 13653 entries (2873 failures)
replaying from journal: /opt/omd/sites/test/var/rrdcached/rrd.journal.1726569365.484252
Replayed 1300 entries (0 failures)
started new journal /opt/omd/sites/test/var/rrdcached/rrd.journal.1726569515.466606
journal processing complete
listening for connections

Timon · September 17, 2024, 3:14pm

Hello mike,

Yes, I get a lot of errors.

starting up
checking for journal files
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726565091.678467
Replayed 25358 entries (1328731 failures)
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726572291.678848
Replayed 307534 entries (975047 failures)
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726579417.913437
Replayed 515367 entries (308101 failures)
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726584088.666084
Replayed 0 entries (0 failures)
replaying from journal: /data/omd/sites/$site/var/rrdcached/rrd.journal.1726584214.798012
Replayed 22249 entries (0 failures)
started new journal /data/omd/sites/$site/var/rrdcached/rrd.journal.1726584404.961194
journal processing complete
listening for connections

in file /data/omd/sites/$site/var/rrdcached/rrd.journal.1726565091.678467 are a loot off erros like this
update /data/omd/sites/$site/var/check_mk/rrd/$hostname/NTP.rrd 1726565295:-0.000899:2:2:0:0

and this

update /data/omd/sites/$site/var/pnp4nagios/perfdata/$hostname/Notify_number_of_notifications_per_contact_w_20_c_50_genomics_servicedesk_num.rrd 1726565295:22

mike1098 · September 17, 2024, 7:19pm

The $site looks wrong to me. In my case its the real full path.
Can you post the command line from ps output pls

mike1098 · September 17, 2024, 7:30pm

Check in file ~/etc/init.d/rrdcached following lines:

# Please do not touch the code below
CACHE_DIR="/omd/sites/test/tmp/rrdcached"
JOURNAL_DIR="/omd/sites/test/var/rrdcached"
SOCKET="/omd/sites/test/tmp/run/rrdcached.sock"
PIDFILE="/omd/sites/test/tmp/rrdcached.pid"
LOGFILE="/omd/sites/test/var/log/rrdcached.log"
USER="test"
GROUP="test"

My site name is ‘test’, please replace with your site name.

Timon · September 18, 2024, 6:16am

This file looks exactly like yours

~/etc/init.d/rrdcached following lines:


# Please do not touch the code below
CACHE_DIR="/omd/sites/myserver/tmp/rrdcached"
JOURNAL_DIR="/omd/sites/myserver/var/rrdcached"
SOCKET="/omd/sites/myserver/tmp/run/rrdcached.sock"
PIDFILE="/omd/sites/myserver/tmp/rrdcached.pid"
LOGFILE="/omd/sites/myserver/var/log/rrdcached.log"
USER="myserver"
GROUP="myserver"
OPTS="-t $WRITE_THREADS -w $TIMEOUT -z $RANDOM_DELAY -f $FLUSH_TIMEOUT -s $GROUP -m 660 -l unix:$SOCKET -p $PIDFILE -j $JOURNAL_DIR -o $LOGFILE"
DAEMON="/omd/sites/myserver/bin/rrdcached"

“/omd/sites/myserver/tmp/rrdcached” empty

and the command is

/omd/sites/myserver/bin/rrdcached -t 4 -w 3600 -z 1800 -f 7200 -s myserver -m 660 -l unix:/omd/sites/myserver/tmp/run/rrdcached.sock -p /omd/sites/myserver/tmp/rrdcached.pid -j /omd/sites/myserver/var/rrdcached -o /omd/sites/myserver/var/log/rrdcached.log

i replayed the site name with mysite

robin.gierse · October 2, 2024, 8:49am

This should solve your RRD issue at least. “mysite” is a placeholder for the name of your site.
In general, when looking at the official user guide or the knowledge base, strings starting with “my” always indicate a placeholder.

Dirk · October 2, 2024, 10:04am

@robin.gierse I’ve seen you edited that KB article. Could you please also replace the wrong command mysite with test:

@Timon is not the first one who stepped into this trap:

robin.gierse · October 2, 2024, 10:22am

Done, thanks for noticing! That totally slipped through my proofreading.

Timon · October 2, 2024, 10:33am

Thanks, I was already wondering

The RRD are fixed now Thanks