running omd start will start the service and it will run for ~60 seconds then stop again. There is no error messages in nagios.log or anywhere else we can locate.
Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
cmk --debug -vvn hostname
value store: loading from disk
Checkmk version 2.4.0
Failed to lookup IPv4 address of hostname via DNS: [Errno -2] Name or service not known(!!)
This service was upgraded from 2.3.0 latest to 2.4.0 latest and has been failing since.
setting the debug logging in etc/nagios/nagios.d/logging.cfg to a level of -1 and verbose produces no errors in the debug.log to troubleshoot.
OMD[cmk]:~$ find ~/local/lib/python3/ -type d -name '*.*-info'
OMD[cmk]:~$ mkp list
Name Version Title Author Req. Version Until Version Files State
---- ------- ----- ------ ------------ ------------- ----- -----
OMD[cmk]:~$
No custom module or code.
I’ve seemed to get nagios stable by disabling a bunch of rules and notifications and monitoring of rabbitmq
Next is crash errors on service discovery of agents.
Error running automation call service-discovery-preview (exit code 2), error:
Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Looks to be an issue with the host data after migration. There were 4 hosts of 12 that were crashing with an agent crashlog. Simply deleting the host and recreating it allowed the host and service discovery to work again.
I prior tested the host with cmk -vv --debug -I hostname and had no errors prior to deletion. After the gui and the console is working fine again.
The version was 2.3.0p31 → 2.4.0. Basically the versions are set to follow tags 2.3.0-latest with watchtower restarting anytime there’s been a new tag released. We move to 2.4.0-latest yesterday as a part of the update. At one point we tested cloud then downgraded to raw.
Zero errors in syslog (container and host)
In ui-job-scheduler.log nothing just normal events
No errors in nagios.log just a bunch of INITIAL SERVICE STATE.
Rules I disabled were related to ignoring select filesystems on hosts.
rabbitmq monitoring was the integration of rabbitmq rules to a host “Request data from a RabbitMQ instance”.
I tried creating a fresh 2.3.0p31 CRE site with those 2 python packages and a rabbitmq rule and updating to 2.4.0.cre works fine. Nagios Core still runs after the update.
Trying to acquire lock on /omd/sites/cmk/etc/check_mk/main.mk
Got lock on /omd/sites/cmk/etc/check_mk/main.mk
Generating configuration for core (type nagios)...
Trying to acquire lock on /omd/sites/cmk/var/check_mk/passwords_merged
Got lock on /omd/sites/cmk/var/check_mk/passwords_merged
Releasing lock on /omd/sites/cmk/var/check_mk/passwords_merged
Released lock on /omd/sites/cmk/var/check_mk/passwords_merged
Trying to acquire lock on /omd/sites/cmk/var/check_mk/core/helper_config/serial.mk
Got lock on /omd/sites/cmk/var/check_mk/core/helper_config/serial.mk
Releasing lock on /omd/sites/cmk/var/check_mk/core/helper_config/serial.mk
Released lock on /omd/sites/cmk/var/check_mk/core/helper_config/serial.mk
Trying to acquire lock on /omd/sites/cmk/var/check_mk/licensing/licensed_state
Got lock on /omd/sites/cmk/var/check_mk/licensing/licensed_state
Releasing lock on /omd/sites/cmk/var/check_mk/licensing/licensed_state
Released lock on /omd/sites/cmk/var/check_mk/licensing/licensed_state
0 piggyback files for 'api.host1'.
0 piggyback files for 'api.host2'.
Trying to acquire lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/notify/host_config/api.host1
Got lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/notify/host_config/api.host1
Releasing lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/notify/host_config/api.host1
Released lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/notify/host_config/api.host1
Trying to acquire lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/notify/host_config/api.host2
Got lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/notify/host_config/api.host2
Releasing lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/notify/host_config/api.host2
Released lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/notify/host_config/api.host2
(snipped out some hosts)
Trying to acquire lock on /omd/sites/cmk/etc/nagios/conf.d/check_mk_objects.cfg
Got lock on /omd/sites/cmk/etc/nagios/conf.d/check_mk_objects.cfg
Releasing lock on /omd/sites/cmk/etc/nagios/conf.d/check_mk_objects.cfg
Released lock on /omd/sites/cmk/etc/nagios/conf.d/check_mk_objects.cfg
Trying to acquire lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/inventory_plugins_index.json
Got lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/inventory_plugins_index.json
Releasing lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/inventory_plugins_index.json
Released lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/inventory_plugins_index.json
Precompiling host checks...Creating precompiled host check config...
Precompiling host checks...
(snipped some hosts all no errors)
api.host1:Trying to acquire lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host1.py
Got lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host1.py
Releasing lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host1.py
Released lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host1.py
==> /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host1.
api.host2:Trying to acquire lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host2.py
Got lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host2.py
Releasing lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host2.py
Released lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host2.py
==> /omd/sites/cmk/var/check_mk/core/helper_config/907/host_checks/api.host2
OK
Running '/omd/sites/cmk/bin/nagios -vp /omd/sites/cmk/tmp/nagios/nagios.cfg'
Validating Nagios configuration...OK
Trying to acquire lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/stored_passwords
Got lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/stored_passwords
Releasing lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/stored_passwords
Released lock on /omd/sites/cmk/var/check_mk/core/helper_config/907/stored_passwords
Releasing lock on /omd/sites/cmk/etc/check_mk/main.mk
Released lock on /omd/sites/cmk/etc/check_mk/main.mk
The downgrade would have been a long time ago. Just wanted to mention that.
Yes, its still happening
Yes the nagios service keeps stopping. Note yesterday it was stopping every few mins in the am then it worked for hours with no issues only to resume stopping again overnight.
Sure , this is the most recent crash
2025-05-07 15:37:21,514 [40] [cmk.web 37546] Unhandled exception (Crash ID: 4fbc966e-2b72-11f0-b1e9-06f90f2b9d11)
Traceback (most recent call last):
File "/omd/sites/cmk/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 327, in query_row
return result[0]
~~~~~~^^^
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/omd/sites/cmk/lib/python3/cmk/gui/pages.py", line 102, in handle_page
action_response = self.page()
^^^^^^^^^^^
File "/omd/sites/cmk/lib/python3/cmk/gui/views/page_ajax_reschedule.py", line 30, in page
return self._do_reschedule(api_request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/omd/sites/cmk/lib/python3/cmk/gui/views/page_ajax_reschedule.py", line 114, in _do_reschedule
row = self._wait_for(site, host, what, wait_spec, now, add_filter)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/omd/sites/cmk/lib/python3/cmk/gui/views/page_ajax_reschedule.py", line 43, in _wait_for
return sites.live().query_row(
^^^^^^^^^^^^^^^^^^^^^^^
File "/omd/sites/cmk/lib/python3.12/site-packages/cmk/livestatus_client/__init__.py", line 329, in query_row
raise MKLivestatusNotFoundError(
cmk.livestatus_client.MKLivestatusNotFoundError: No matching entries found for query: GET services
WaitObject: ahost1;Check_MK
WaitCondition: last_check >= 1746643034
WaitTimeout: 10000
WaitTrigger: check
Columns: last_check state plugin_output
Filter: host_name = ahost1
Filter: service_description = Check_MK
This is one of the hosts that had crash events in the dashboard on check_mk agent status. We deleted this and 3 other hosts and recreated them which got rid of these errors.
That was just the solution to the discovery crashes of a host. The overall nagios service kept stopping and continues to. We started over from scratch and rebuilt the monitoring on 2.4.0 while having the volume data in a separate local docker instance to reference.
We can close this thread as the problem is fixed rebuilding.
2.5.0 is still in pre-alpha stage and only suitable for testing and debugging new features. It is absolutely not suited for productive use. In case it is the same bug, it will be eventually fixed, but production versions (read current stable 2.4.0 and old stable 2.3.0) have priority.