We installed 2.4.0 a month ago and until last week everything went great.
Since then the nagios service keeps stopping every 60 seconds.
We upgraded to v2.4.0p2 today, but the issues still remains.
CMK version:
2.4.0p2 OS version:
ubuntu-server-24.04.2-LTS
Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
> OMD[DFE]:~/var/log$ cmk --debug -vvn checkmk
> value store: loading from disk
> Checkmk version 2.4.0p2
> Updating IPv4 DNS cache for checkmk: 127.0.1.1
> Trying to acquire lock on /omd/sites/DFE/var/check_mk/ipaddresses.cache
> Got lock on /omd/sites/DFE/var/check_mk/ipaddresses.cache
> Releasing lock on /omd/sites/DFE/var/check_mk/ipaddresses.cache
> Released lock on /omd/sites/DFE/var/check_mk/ipaddresses.cache
> + FETCHING DATA
> Source: SourceInfo(hostname='checkmk', ipaddress='127.0.1.1', ident='piggyback', fetcher_type=<FetcherType.PIGGYBACK: 4>, source_type=<SourceType.HOST: 1>)
> [cpu_tracking] Start [7333f91a21b0]
> Read from cache: NoCache(path_template=/dev/null, max_age=MaxAge(checking=0.0, discovery=0.0, inventory=0.0), simulation=False, use_only_cache=False, file_cache_mode=1)
> 0 piggyback files for 'checkmk'.
> 0 piggyback files for '127.0.1.1'.
> Get piggybacked data
> [cpu_tracking] Stop [7333f91a21b0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
> [cpu_tracking] Start [7333f96e57f0]
> 0 piggyback files for 'checkmk'.
> + PARSE FETCHER RESULTS
> HostKey(hostname='checkmk', source_type=<SourceType.HOST: 1>) -> Add sections: []
> Received no piggyback data
> [cpu_tracking] Stop [7333f96e57f0 - Snapshot(process=posix.times_result(user=0.010000000000000009, system=0.009999999999999995, children_user=0.0, children_system=0.0, elapsed=0.009999999776482582))]
> [piggyback] Success (but no data found for this host), execution time 0.0 sec | execution_time=0.010 user_time=0.010 system_time=0.010 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.000
cmk --debug -O -vv looks okay
here are the last couple of lines
> DC02 :Trying to acquire lock on /omd/sites/DFE/var/check_mk/core/helper_config/179/host_checks/DC02.py
> Got lock on /omd/sites/DFE/var/check_mk/core/helper_config/179/host_checks/DC02.py
> Releasing lock on /omd/sites/DFE/var/check_mk/core/helper_config/179/host_checks/DC02.py
> Released lock on /omd/sites/DFE/var/check_mk/core/helper_config/179/host_checks/DC02.py
> ==> /omd/sites/DFE/var/check_mk/core/helper_config/179/host_checks/DC02.
> OK
> Running '/omd/sites/DFE/bin/nagios -vp /omd/sites/DFE/tmp/nagios/nagios.cfg'
> Validating Nagios configuration...OK
> Trying to acquire lock on /omd/sites/DFE/var/check_mk/core/helper_config/179/stored_passwords
> Got lock on /omd/sites/DFE/var/check_mk/core/helper_config/179/stored_passwords
> Releasing lock on /omd/sites/DFE/var/check_mk/core/helper_config/179/stored_passwords
> Released lock on /omd/sites/DFE/var/check_mk/core/helper_config/179/stored_passwords
> Reloading monitoring core...OK
> Releasing lock on /omd/sites/DFE/etc/check_mk/main.mk
> Released lock on /omd/sites/DFE/etc/check_mk/main.mk
> OMD[DFE]:~$
Did you find a solution? I have the same problem (Signalled to death by signal 15) after updating to 2.4. started with p1 and still there with p4.
nagios keeps crashing but I don’t find any reason.
Hi! I have the same problem with 2.4.0p4.cre.
A few minutes after starting up my site with checkmk RAW i can see that nagios dies (as indicated by omd status)
OMD[gpmv]:~$ ls -l ~/var/mkeventd/history -a
total 32
drwx------ 2 gpmv gpmv 4096 Jun 20 21:13 ./
drwxr-xr-x 4 gpmv gpmv 4096 Jun 20 21:15 ../
-rw------- 1 gpmv gpmv 24576 Jun 20 21:13 history.sqlite
I do not see anything relevant in nagios.log - it just stops adding more records
notify.log (I’m using /usr/sbin/sendmail to send notifications) starts complaining about socket being down:
2025-06-20 21:16:48,358 [20] [cmk.events.log_to_history] sending command LOG;SERVICE NOTIFICATION RESULT: mzalewski;borg2;Check_MK;OK;mail;Spooled mail to local mail transmission agent;Spooled mail to local mail transmission agent
2025-06-20 21:16:49,678 [40] [cmk.events.log_to_history] Cannot send livestatus command (Timeout: 2 sec)
Traceback (most recent call last):
File "/omd/sites/gpmv/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 618, in _create_new_socket_connection
site_socket.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/omd/sites/gpmv/lib/python3.12/site-packages/cmk/events/log_to_history.py", line 53, in _livestatus_cmd
connection.command(f"[{time.time():.0f}] {command}")
File "/omd/sites/gpmv/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 925, in command
self.send_command(f"COMMAND {command_str}")
File "/omd/sites/gpmv/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 929, in send_command
self.connect()
File "/omd/sites/gpmv/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 592, in connect
site_socket = self._create_new_socket_connection()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/omd/sites/gpmv/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 641, in _create_new_socket_connection
raise MKLivestatusSocketError(f"Cannot connect to '{self.socketurl}': {e}")
cmk.livestatus_client.MKLivestatusSocketError: Cannot connect to 'unix:/omd/sites/gpmv/tmp/run/live': [Errno 111] Connection refused
~/var/log/ui-job-scheduler/ui-job-scheduler.log does not seem to show anything bad.
~/var/log/automation-helper/automation-helper.log seems OK as well
cmk --debug -O -vv runs fine.
I do not see any crash reports relevant in Monitor >> Crash reports.
The system runs sometimes 10 minutes or so and nagios just dies. 4+GB ram, 10 cores, about 1900 services.
Hey Jakkul, could you elaborate on how you managed to track it down to that one host?
I’m facing the exact same issue and have been trying to narrow it down as well.
Hey @aSilentSniper! I was looking at /omd/sites/MYSITE/var/nagios/nagios.log. This shows what things that got registered on every host checkmk is looking at. And it stops growing when nagios dies. So I just reviewed few last hosts shown and looked at few last hosts i’ve reconfigured. And bingo. I found the culprit.
Problem is still there with 2.4p5. nagios just dies. logs are clean like @jakkul
no reason why nagios chrashes. this happens sometime after 10m; sometimes after some hours.
I tried to disable hosts which where last checked in nagios.log before it stopped. Didn’t help as agent update / servie rescan.
turned on debug logging in nagios.
last chrash gives an error in debug.log:
[1751245183] SERVICE ALERT: g8microserver;Temperature Zone 1;WARNING;HARD;1;Temperature: 56.0 °C (warn/crit at 55.0 °C/65.0 °C)(!)
[1751245183] SERVICE NOTIFICATION: check-mk-notify;g8microserver;Temperature Zone 1;WARNING;check-mk-notify;Temperature: 56.0 °C (warn/crit at 55.0 °C/65.0 °C)(!)```
could this be related to the nagios crash?
but notification mails work in general?!
nagios crash gets triggered by this notification (but only by this notification; others work fine).
I am able to reproduce the crash: Analyze recent notifications → replay this notification. nagios crashes immediately!
2025-06-30 12:34:24,689 [20] [cmk.events.log_to_history] sending command LOG;SER
VICE NOTIFICATION RESULT: cmkadmin;g8microserver;Temperature Zone 1;OK;mail;Spoo
led mail to local mail transmission agent;Spooled mail to local mail transmissio
n agent
notify.log
2025-06-30 12:34:24,689 [20] [cmk.events.log_to_history] sending command LOG;SER
VICE NOTIFICATION RESULT: cmkadmin;g8microserver;Temperature Zone 1;OK;mail;Spoo
led mail to local mail transmission agent;Spooled mail to local mail transmissio
n agent
2025-06-30 12:34:26,010 [40] [cmk.events.log_to_history] Cannot send livestatus
command (Timeout: 2 sec)
Traceback (most recent call last):
File "/omd/sites/monitoring/lib/python3.12/site-packages/cmk/livestatus_client
/__init__.py", line 618, in _create_new_socket_connection
site_socket.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/omd/sites/monitoring/lib/python3.12/site-packages/cmk/events/log_to_history.py", line 53, in _livestatus_cmd
connection.command(f"[{time.time():.0f}] {command}")
File "/omd/sites/monitoring/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 925, in command
self.send_command(f"COMMAND {command_str}")
File "/omd/sites/monitoring/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 929, in send_command
/__init__.py", line 929, in send_command
self.connect()
File "/omd/sites/monitoring/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 592, in connect
site_socket = self._create_new_socket_connection()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/omd/sites/monitoring/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 641, in _create_new_socket_connection
raise MKLivestatusSocketError(f"Cannot connect to '{self.socketurl}': {e}")
cmk.livestatus_client.MKLivestatusSocketError: Cannot connect to 'unix:/omd/sites/monitoring/tmp/run/live': [Errno 111] Connection refused
2025-06-30 12:34:26,011 [20] [cmk.events.log_to_history] Command was: LOG;SERVICE NOTIFICATION RESULT: cmkadmin;g8microserver;Temperature Zone 1;OK;mail;Spooled mail to local mail transmission agent;Spooled mail to local mail transmission agent
crash seems to be caused by faulty graph data. nagios can be crashed by visiting the service in Wato. no graphs are rendered; instead nagios crashes instantly.
other service graphs show the same behavior (but not all).
rrdcached.log
2025-06-30 11:23:36 [3] handle_request_update: Could not read RRD file.
I could solve the problem for me by simply deleting the faulty rrds of the affected hosts under ~/var/pnp4nagios/perfdata (deleted complete dir of the host).
rrds get recreated et voilà nagios doesn’t crash any more (and wato shows me graphs again).