peterge
(Peter Gerhards)
July 1, 2024, 9:03am
1
CMK version: 2.3.0p6
OS version: Debian 11
Error message:
I updated our CMK Raw server to 2.3.0p6 last week (after I was facing this problem on 2.2.0).
In the night from 27. to 28.6. our site was stopped completely, now its the second time (first time occured this morning at 09:18) that our site is only running partially with the service rrdchached stopped.
systemctl status doesnt give much more information:
root@checkmk:~# systemctl status check-mk-raw-2.3.0p6.service
* check-mk-raw-2.3.0p6.service - LSB: OMD sites
Loaded: loaded (/etc/init.d/check-mk-raw-2.3.0p6; generated)
Active: active (exited) since Fri 2024-06-28 09:18:24 CEST; 3 days ago
Docs: man:systemd-sysv-generator(8)
Tasks: 0 (limit: 309029)
Memory: 0B
CPU: 0
CGroup: /system.slice/check-mk-raw-2.3.0p6.service
Jun 28 09:18:24 checkmk systemd[1]: Starting LSB: OMD sites...
Jun 28 09:18:24 checkmk check-mk-raw-2.3.0p6[112]: OMD autostart disabled, skipping ...
Jun 28 09:18:24 checkmk systemd[1]: Started LSB: OMD sites.
Is there any log which might have the info why its stoping?
Hello @peterge
Check logs under /opt/omd/sites/{site_name}/var/log/ directory. check rrdcached log to identify the issue.
Regards,
DD
peterge
(Peter Gerhards)
July 1, 2024, 12:09pm
3
root@checkmk:~# ls -l /opt/omd/sites/misoft/var/log/rrdcached.log*
-rw-r----- 1 misoft misoft 0 Jul 1 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log
-rw-r----- 1 misoft misoft 0 Jun 30 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.1
-rw-r----- 1 misoft misoft 20 Jun 23 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.2.gz
-rw-r----- 1 misoft misoft 20 Jun 22 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.3.gz
-rw-r----- 1 misoft misoft 20 Jun 21 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.4.gz
-rw-r----- 1 misoft misoft 20 Jun 20 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.5.gz
-rw-r----- 1 misoft misoft 20 Jun 19 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.6.gz
-rw-r----- 1 misoft misoft 20 Jun 17 00:00 /opt/omd/sites/misoft/var/log/rrdcached.log.7.gz
root@checkmk:~# cat /opt/omd/sites/misoft/var/log/rrdcached.log
root@checkmk:~# cat /opt/omd/sites/misoft/var/log/rrdcached.log.1
root@checkmk:~#
those are empty…
Is rrdcached service not able to start or it is starting at the beginning and stop afterwards ?
peterge
(Peter Gerhards)
July 1, 2024, 12:44pm
5
It gets started when I run omd start or omd restart. But it stops randomly…
The whole site stopped in the night from 27.6. to 28.6. and it stopped two times this morning. I did not do anythings, just ran a omd start again
You need to check all logs to find out the reason.
I have also faced same issue in previous versions where not got any reason. Till root cause not concluded we have written script which will check “omd status” at every 5 mins and in case any of the service found as “STOPPED”, “omd start” command is executed.
You can apply same kind of logic to avoid big outage.
Regards,
Dinesh
peterge
(Peter Gerhards)
July 1, 2024, 1:47pm
7
Yeah I was thinking about implementing such a little command as cronjob. I will check if it occurs again the next days…
boomlau
(AljaĹľ Eferl)
July 4, 2024, 6:20am
8
Did you renamed site or something becuase I had the same problem when renaming site.
peterge
(Peter Gerhards)
July 15, 2024, 6:06am
9
nope i did not rename anything
peterge
(Peter Gerhards)
July 15, 2024, 6:08am
10
My current solution is:
root@checkmk:~# crontab -l
[...]
*/5 * * * * /bin/systemctl is-failed omd && echo "reboot through cron @$(date)" >> /root/cron.log && /sbin/reboot
root@checkmk:~# cat .bashrc
[...]
echo ""
echo "root@checkmk:~# cat /root/cron.log"
[ -e /root/cron.log ] && cat /root/cron.log || echo "<log file does not exist, so no reboot did happen by cron>" && echo ""
mrbyte
(Albert)
August 19, 2024, 11:50am
11
Hello!
I’ve a similar issue, so I think that it can be placed here. If not, I’ll open a separate thread:
CMK version: Cloud Edition (Free) 2.3.0p12
CMK core: CMC
Hosts: 12
Services: 117
OS version: Ubuntu 22.04 running as a VM on Proxmox
Sunday morning and today the site suddenly stopped working, I’ve realized that when I’ve tried to access to the web interface.
I have listed the logs at /opt/omd/sites/{site_name}/var/log/, there are these logs files availables:
agent-registration.log
alerts.log
cmc.log
dcd.log
diskspace.log
licensing.log
liveproxyd.log
mkeventd.log
mknotifyd.log
notify.log
redis-server.log
rrdcached.log
security.log
update.log
web.log
This is the output of cat cmc.log related to the timestamp when the site stops, at 05:59:
2024-08-19 05:41:16 [5] [core 2947] Executing external command: LOG;SERVICE NOTIFICATION: cmkadmin;Proxmox-2;Temperature Zone 5;OK;mail;Temperature: 68.0 °C
2024-08-19 05:41:18 [5] [core 2947] Executing external command: LOG;SERVICE NOTIFICATION RESULT: cmkadmin;Proxmox-2;Temperature Zone 5;OK;mail;Spooled mail to local mail transmission agent;Spooled mail to local mail transmission agent
2024-08-19 05:59:22 [5] [icmpreceiver 3002] terminated
2024-08-19 05:59:22 [5] [icmpsender 3001] exited normally
2024-08-19 05:59:24 [5] [notification helper 3000] exited normally
2024-08-19 05:59:27 [4] [alert helper 2973] forcibly terminating child process
2024-08-19 05:59:27 [3] [alert helper 2973] killed by signal 9
2024-08-19 05:59:27 [5] [main] [rrdcached at "/omd/sites/mrbyte_lan/tmp/run/rrdcached.sock"] stopping...
2024-08-19 05:59:27 [5] [rrdcached] [rrdcached at "/omd/sites/mrbyte_lan/tmp/run/rrdcached.sock"] closing connection
2024-08-19 05:59:27 [5] [rrdcached] [rrdcached at "/omd/sites/mrbyte_lan/tmp/run/rrdcached.sock"] stopped
2024-08-19 05:59:30 [4] [main] [RRD helper 2971] forcibly terminating child process
2024-08-19 05:59:30 [3] [main] [RRD helper 2971] killed by signal 9
2024-08-19 05:59:30 [5] [main] [carbon connection pool] stopping...
2024-08-19 05:59:30 [5] [carbon] [carbon connection pool] stopped
2024-08-19 05:59:30 [5] [main] [influxdb connection pool] stopping...
2024-08-19 05:59:30 [5] [influxdb] [influxdb connection pool] stopped
2024-08-19 05:59:30 [5] [core 2947] [main] stopping config cleaner...
2024-08-19 05:59:30 [5] [core 2947] [config cleaner] stopped
2024-08-19 05:59:30 [5] [core 2947] [main] stopping state saver...
2024-08-19 05:59:30 [5] [core 2947] [state saver] stopped
2024-08-19 05:59:30 [5] [generic pool] [helper 2977] exited normally
2024-08-19 05:59:30 [5] [generic pool] [helper 2978] exited normally
2024-08-19 05:59:31 [5] [generic pool] [helper 2979] exited normally
2024-08-19 05:59:31 [5] [generic pool] [helper 2980] exited normally
2024-08-19 05:59:31 [5] [generic pool] [helper 2981] exited normally
2024-08-19 05:59:34 [4] [checker pool] [helper 2982] forcibly terminating child process
2024-08-19 05:59:34 [3] [checker pool] [helper 2982] killed by signal 9
2024-08-19 05:59:37 [4] [checker pool] [helper 2983] forcibly terminating child process
2024-08-19 05:59:37 [3] [checker pool] [helper 2983] killed by signal 9
2024-08-19 05:59:40 [4] [checker pool] [helper 2984] forcibly terminating child process
2024-08-19 05:59:40 [3] [checker pool] [helper 2984] killed by signal 9
2024-08-19 05:59:43 [4] [checker pool] [helper 2985] forcibly terminating child process
2024-08-19 05:59:43 [3] [checker pool] [helper 2985] killed by signal 9
2024-08-19 05:59:45 [5] [real-time pool] [helper 2986] exited normally
2024-08-19 05:59:45 [5] [fetcher pool] [service "Checkmk;Check_MK"] [helper 2987] aborting running check
2024-08-19 05:59:45 [5] [fetcher pool] [helper 2987] terminated
2024-08-19 05:59:45 [5] [fetcher pool] [helper 2988] exited normally
2024-08-19 05:59:46 [5] [fetcher pool] [helper 2989] exited normally
2024-08-19 05:59:47 [5] [fetcher pool] [helper 2990] exited normally
2024-08-19 05:59:48 [5] [fetcher pool] [helper 2991] exited normally
2024-08-19 05:59:49 [5] [fetcher pool] [helper 2992] exited normally
2024-08-19 05:59:50 [5] [fetcher pool] [helper 2993] exited normally
2024-08-19 05:59:50 [5] [fetcher pool] [helper 2994] exited normally
2024-08-19 05:59:51 [5] [fetcher pool] [helper 2995] exited normally
2024-08-19 05:59:52 [5] [fetcher pool] [helper 2996] exited normally
2024-08-19 05:59:52 [5] [fetcher pool] [helper 2997] exited normally
2024-08-19 05:59:53 [5] [fetcher pool] [helper 2998] exited normally
2024-08-19 05:59:54 [5] [fetcher pool] [helper 2999] exited normally
2024-08-19 05:59:54 [5] [main] [livestatus manager] terminating 20 client threads...
2024-08-19 05:59:54 [5] [main] [client 0] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 0] terminated
2024-08-19 05:59:54 [5] [main] [client 1] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 1] terminated
2024-08-19 05:59:54 [5] [main] [client 2] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 2] terminated
2024-08-19 05:59:54 [5] [main] [client 3] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 3] terminated
2024-08-19 05:59:54 [5] [main] [client 4] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 4] terminated
2024-08-19 05:59:54 [5] [main] [client 5] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 5] terminated
2024-08-19 05:59:54 [5] [main] [client 6] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 6] terminated
2024-08-19 05:59:54 [5] [main] [client 7] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 7] terminated
2024-08-19 05:59:54 [5] [main] [client 8] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 8] terminated
2024-08-19 05:59:54 [5] [main] [client 9] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 9] terminated
2024-08-19 05:59:54 [5] [main] [client 10] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 10] terminated
2024-08-19 05:59:54 [5] [main] [client 11] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 11] terminated
2024-08-19 05:59:54 [5] [main] [client 12] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 12] terminated
2024-08-19 05:59:54 [5] [main] [client 13] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 13] terminated
2024-08-19 05:59:54 [5] [main] [client 14] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 14] terminated
2024-08-19 05:59:54 [5] [main] [client 15] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 15] terminated
2024-08-19 05:59:54 [5] [main] [client 16] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 16] terminated
2024-08-19 05:59:54 [5] [main] [client 17] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 17] terminated
2024-08-19 05:59:54 [5] [main] [client 18] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 18] terminated
2024-08-19 05:59:54 [5] [main] [client 19] waiting for termination...
2024-08-19 05:59:54 [5] [main] [client 19] terminated
2024-08-19 05:59:54 [5] [main] [livestatus manager] all threads have terminated
2024-08-19 05:59:54 [5] [main] [livestatus manager] removed socket /omd/sites/mrbyte_lan/tmp/run/live
2024-08-19 05:59:54 [5] [core 2947] deleting configuration 391 (0x25fbcd0) from 2024-08-18 12:19:29 with 12 hosts and 117 services
2024-08-19 05:59:54 [5] [core 2947] Shutdown of core successful.
2024-08-19 05:59:54 [5] [core 2947] This is the end. Good bye and thanks for running me.
2024-08-19 05:59:54 [5] [core 2947] flushing state history cache
2024-08-19 05:59:54 [5] [core 2947] Check_MK Micro Core exiting. Good bye.
2024-08-19 11:18:42 [5] [core 2975] daemon logging started
When I’ve ralized that the site doesn’t work I have rebooted the VM at 11:15 AM aproximately.
I see that rrdcached log is empty, and some references to icmpsender and icmpreceiver exiting, and rrdcached also stopping.
From here, I don’t know what’s next…
Kind regards!!
The log locks not like a crash or anything. It is a normal “omd stop” command.
mrbyte
(Albert)
August 19, 2024, 12:13pm
13
Hello @andreas-doehler !!
This is the weirdest thing, at 05:59 I was sleeping. So I don’t understand how it is possible to run a omd stop command without my interaction…
I use Checkmk to monitor my home servers, I am my own “sysadmin”…
Kind regards!!
First i would have a lock at the system log at this time.
But it is only speculation.
mrbyte
(Albert)
August 19, 2024, 12:17pm
15
Where I could find the system log? I don’t see it at /opt/omd/sites/{site_name}/var/log/…
directly at /var/log/… file depends on your OS
mrbyte
(Albert)
August 19, 2024, 12:18pm
17
Ok, I’m going to see there…
Kind regards!!
mrbyte
(Albert)
August 19, 2024, 12:21pm
18
Just one more question…
What should I search? What kind of event or text could be relevant?
Thank you very much!!
It can be some cronjob or anything. I don’t know what is installed on this machine.
mrbyte
(Albert)
August 19, 2024, 12:25pm
20
Great, that is a point of start!!
This machine is a VM running Ubuntu Server only for Checkmk.
Kind regards!!