CMK Raw 2.3.0p6 site stopped/partially running

CMK version: 2.3.0p6
OS version: Debian 11
Error message:

I updated our CMK Raw server to 2.3.0p6 last week (after I was facing this problem on 2.2.0).
In the night from 27. to 28.6. our site was stopped completely, now its the second time (first time occured this morning at 09:18) that our site is only running partially with the service rrdchached stopped.

systemctl status doesnt give much more information:

root@checkmk:~# systemctl status check-mk-raw-2.3.0p6.service 
* check-mk-raw-2.3.0p6.service - LSB: OMD sites
     Loaded: loaded (/etc/init.d/check-mk-raw-2.3.0p6; generated)
     Active: active (exited) since Fri 2024-06-28 09:18:24 CEST; 3 days ago
       Docs: man:systemd-sysv-generator(8)
      Tasks: 0 (limit: 309029)
     Memory: 0B
        CPU: 0
     CGroup: /system.slice/check-mk-raw-2.3.0p6.service

Jun 28 09:18:24 checkmk systemd[1]: Starting LSB: OMD sites...
Jun 28 09:18:24 checkmk check-mk-raw-2.3.0p6[112]: OMD autostart disabled, skipping ...
Jun 28 09:18:24 checkmk systemd[1]: Started LSB: OMD sites.

Is there any log which might have the info why its stoping?

Hello @peterge

Check logs under /opt/omd/sites/{site_name}/var/log/ directory. check rrdcached log to identify the issue.

Regards,
DD

root@checkmk:~# ls -l /opt/omd/sites/misoft/var/log/rrdcached.log*
-rw-r----- 1 misoft misoft  0 Jul  1 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log
-rw-r----- 1 misoft misoft  0 Jun 30 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.1
-rw-r----- 1 misoft misoft 20 Jun 23 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.2.gz
-rw-r----- 1 misoft misoft 20 Jun 22 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.3.gz
-rw-r----- 1 misoft misoft 20 Jun 21 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.4.gz
-rw-r----- 1 misoft misoft 20 Jun 20 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.5.gz
-rw-r----- 1 misoft misoft 20 Jun 19 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.6.gz
-rw-r----- 1 misoft misoft 20 Jun 17 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.7.gz
root@checkmk:~# cat /opt/omd/sites/misoft/var/log/rrdcached.log 
root@checkmk:~# cat /opt/omd/sites/misoft/var/log/rrdcached.log.1 
root@checkmk:~#

those are empty…

Is rrdcached service not able to start or it is starting at the beginning and stop afterwards ?

It gets started when I run omd start or omd restart. But it stops randomly…
The whole site stopped in the night from 27.6. to 28.6. and it stopped two times this morning. I did not do anythings, just ran a omd start again

You need to check all logs to find out the reason.

I have also faced same issue in previous versions where not got any reason. Till root cause not concluded we have written script which will check “omd status” at every 5 mins and in case any of the service found as “STOPPED”, “omd start” command is executed.

You can apply same kind of logic to avoid big outage.

Regards,
Dinesh

Yeah I was thinking about implementing such a little command as cronjob. I will check if it occurs again the next days…

Did you renamed site or something becuase I had the same problem when renaming site.

nope i did not rename anything

My current solution is:

root@checkmk:~# crontab -l 
[...]
*/5 * * * * /bin/systemctl is-failed omd && echo "reboot through cron @$(date)" >> /root/cron.log && /sbin/reboot

root@checkmk:~# cat .bashrc 
[...]
echo ""
echo "root@checkmk:~# cat /root/cron.log"
[ -e /root/cron.log ] && cat /root/cron.log || echo "<log file does not exist, so no reboot did happen by cron>" && echo ""

Hello!
I’ve a similar issue, so I think that it can be placed here. If not, I’ll open a separate thread:

CMK version: Cloud Edition (Free) 2.3.0p12
CMK core: CMC
Hosts: 12
Services: 117
OS version: Ubuntu 22.04 running as a VM on Proxmox

Sunday morning and today the site suddenly stopped working, I’ve realized that when I’ve tried to access to the web interface.

I have listed the logs at /opt/omd/sites/{site_name}/var/log/, there are these logs files availables:

  • agent-registration.log
  • alerts.log
  • cmc.log
  • dcd.log
  • diskspace.log
  • licensing.log
  • liveproxyd.log
  • mkeventd.log
  • mknotifyd.log
  • notify.log
  • redis-server.log
  • rrdcached.log
  • security.log
  • update.log
  • web.log

This is the output of cat cmc.log related to the timestamp when the site stops, at 05:59:

2024-08-19 05:41:16 [5] [core 2947] Executing external command: LOG;SERVICE NOTIFICATION: cmkadmin;Proxmox-2;Temperature Zone 5;OK;mail;Temperature: 68.0 °C
2024-08-19 05:41:18 [5] [core 2947] Executing external command: LOG;SERVICE NOTIFICATION RESULT: cmkadmin;Proxmox-2;Temperature Zone 5;OK;mail;Spooled mail to local mail transmission agent;Spooled mail to local mail transmission agent
2024-08-19 05:59:22 [5] [icmpreceiver 3002] terminated
2024-08-19 05:59:22 [5] [icmpsender 3001] exited normally
2024-08-19 05:59:24 [5] [notification helper 3000] exited normally
2024-08-19 05:59:27 [4] [alert helper 2973] forcibly terminating child process
2024-08-19 05:59:27 [3] [alert helper 2973] killed by signal 9
2024-08-19 05:59:27 [5] [main] [rrdcached at "/omd/sites/mrbyte_lan/tmp/run/rrdcached.sock"] stopping...
2024-08-19 05:59:27 [5] [rrdcached] [rrdcached at "/omd/sites/mrbyte_lan/tmp/run/rrdcached.sock"] closing connection
2024-08-19 05:59:27 [5] [rrdcached] [rrdcached at "/omd/sites/mrbyte_lan/tmp/run/rrdcached.sock"] stopped
2024-08-19 05:59:30 [4] [main] [RRD helper 2971] forcibly terminating child process
2024-08-19 05:59:30 [3] [main] [RRD helper 2971] killed by signal 9
2024-08-19 05:59:30 [5] [main] [carbon connection pool] stopping...
2024-08-19 05:59:30 [5] [carbon] [carbon connection pool] stopped
2024-08-19 05:59:30 [5] [main] [influxdb connection pool] stopping...
2024-08-19 05:59:30 [5] [influxdb] [influxdb connection pool] stopped
2024-08-19 05:59:30 [5] [core 2947] [main] stopping config cleaner...
2024-08-19 05:59:30 [5] [core 2947] [config cleaner] stopped
2024-08-19 05:59:30 [5] [core 2947] [main] stopping state saver...
2024-08-19 05:59:30 [5] [core 2947] [state saver] stopped
2024-08-19 05:59:30 [5] [generic pool] [helper 2977] exited normally
2024-08-19 05:59:30 [5] [generic pool] [helper 2978] exited normally
2024-08-19 05:59:31 [5] [generic pool] [helper 2979] exited normally
2024-08-19 05:59:31 [5] [generic pool] [helper 2980] exited normally
2024-08-19 05:59:31 [5] [generic pool] [helper 2981] exited normally
2024-08-19 05:59:34 [4] [checker pool] [helper 2982] forcibly terminating child process
2024-08-19 05:59:34 [3] [checker pool] [helper 2982] killed by signal 9
2024-08-19 05:59:37 [4] [checker pool] [helper 2983] forcibly terminating child process
2024-08-19 05:59:37 [3] [checker pool] [helper 2983] killed by signal 9
2024-08-19 05:59:40 [4] [checker pool] [helper 2984] forcibly terminating child process
2024-08-19 05:59:40 [3] [checker pool] [helper 2984] killed by signal 9
2024-08-19 05:59:43 [4] [checker pool] [helper 2985] forcibly terminating child process
2024-08-19 05:59:43 [3] [checker pool] [helper 2985] killed by signal 9
2024-08-19 05:59:45 [5] [real-time pool] [helper 2986] exited normally
2024-08-19 05:59:45 [5] [fetcher pool] [service "Checkmk;Check_MK"] [helper 2987] aborting running check
2024-08-19 05:59:45 [5] [fetcher pool] [helper 2987] terminated
2024-08-19 05:59:45 [5] [fetcher pool] [helper 2988] exited normally
2024-08-19 05:59:46 [5] [fetcher pool] [helper 2989] exited normally
2024-08-19 05:59:47 [5] [fetcher pool] [helper 2990] exited normally
2024-08-19 05:59:48 [5] [fetcher pool] [helper 2991] exited normally
2024-08-19 05:59:49 [5] [fetcher pool] [helper 2992] exited normally
2024-08-19 05:59:50 [5] [fetcher pool] [helper 2993] exited normally
2024-08-19 05:59:50 [5] [fetcher pool] [helper 2994] exited normally
2024-08-19 05:59:51 [5] [fetcher pool] [helper 2995] exited normally
2024-08-19 05:59:52 [5] [fetcher pool] [helper 2996] exited normally
2024-08-19 05:59:52 [5] [fetcher pool] [helper 2997] exited normally
2024-08-19 05:59:53 [5] [fetcher pool] [helper 2998] exited normally
2024-08-19 05:59:54 [5] [fetcher pool] [helper 2999] exited normally
2024-08-19 05:59:54 [5] [main] [livestatus manager] terminating 20 client threads...
2024-08-19 05:59:54 [5] [main] [client 0] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 0] terminated
2024-08-19 05:59:54 [5] [main] [client 1] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 1] terminated
2024-08-19 05:59:54 [5] [main] [client 2] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 2] terminated
2024-08-19 05:59:54 [5] [main] [client 3] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 3] terminated
2024-08-19 05:59:54 [5] [main] [client 4] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 4] terminated
2024-08-19 05:59:54 [5] [main] [client 5] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 5] terminated
2024-08-19 05:59:54 [5] [main] [client 6] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 6] terminated
2024-08-19 05:59:54 [5] [main] [client 7] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 7] terminated
2024-08-19 05:59:54 [5] [main] [client 8] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 8] terminated
2024-08-19 05:59:54 [5] [main] [client 9] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 9] terminated
2024-08-19 05:59:54 [5] [main] [client 10] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 10] terminated
2024-08-19 05:59:54 [5] [main] [client 11] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 11] terminated
2024-08-19 05:59:54 [5] [main] [client 12] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 12] terminated
2024-08-19 05:59:54 [5] [main] [client 13] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 13] terminated
2024-08-19 05:59:54 [5] [main] [client 14] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 14] terminated
2024-08-19 05:59:54 [5] [main] [client 15] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 15] terminated
2024-08-19 05:59:54 [5] [main] [client 16] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 16] terminated
2024-08-19 05:59:54 [5] [main] [client 17] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 17] terminated
2024-08-19 05:59:54 [5] [main] [client 18] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 18] terminated
2024-08-19 05:59:54 [5] [main] [client 19] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 19] terminated
2024-08-19 05:59:54 [5] [main] [livestatus manager] all threads have terminated
2024-08-19 05:59:54 [5] [main] [livestatus manager] removed socket /omd/sites/mrbyte_lan/tmp/run/live
2024-08-19 05:59:54 [5] [core 2947] deleting configuration 391 (0x25fbcd0) from 2024-08-18 12:19:29 with 12 hosts and 117 services
2024-08-19 05:59:54 [5] [core 2947] Shutdown of core successful.
2024-08-19 05:59:54 [5] [core 2947] This is the end. Good bye and thanks for running me.
2024-08-19 05:59:54 [5] [core 2947] flushing state history cache
2024-08-19 05:59:54 [5] [core 2947] Check_MK Micro Core exiting. Good bye.
2024-08-19 11:18:42 [5] [core 2975] daemon logging started

When I’ve ralized that the site doesn’t work I have rebooted the VM at 11:15 AM aproximately.

I see that rrdcached log is empty, and some references to icmpsender and icmpreceiver exiting, and rrdcached also stopping.

From here, I don’t know what’s next…

Kind regards!! :grinning:

The log locks not like a crash or anything. It is a normal “omd stop” command.

Hello @andreas-doehler !! :grinning:

This is the weirdest thing, at 05:59 I was sleeping. So I don’t understand how it is possible to run a omd stop command without my interaction… :person_shrugging:

I use Checkmk to monitor my home servers, I am my own “sysadmin”…

Kind regards!!

First i would have a lock at the system log at this time.
But it is only speculation.

Where I could find the system log? I don’t see it at /opt/omd/sites/{site_name}/var/log/…

directly at /var/log/… file depends on your OS

Ok, I’m going to see there…

Kind regards!! :grinning:

Just one more question…

What should I search? What kind of event or text could be relevant?

Thank you very much!!

It can be some cronjob or anything. I don’t know what is installed on this machine.

Great, that is a point of start!! :grinning:

This machine is a VM running Ubuntu Server only for Checkmk.

Kind regards!! :grinning: