OMD site keeps crashing

**CMK version:**2.1.0p16
OS version:"Debian GNU/Linux 11 (bullseye)

Error message:


2023-02-02 08:57:41,566 [20] [cmk.mkeventd] -----------------------------------------------------------------
2023-02-02 08:57:41,567 [20] [cmk.mkeventd] mkeventd version 2.1.0p16 starting
2023-02-02 08:57:41,568 [20] [cmk.mkeventd.EventServer] Created FIFO '/omd/sites/****/tmp/run/mkeventd/events' for receiving events
2023-02-02 08:57:41,568 [20] [cmk.mkeventd.EventServer] Opened UNIX socket '/omd/sites/****/tmp/run/mkeventd/eventsocket' for receiving events
2023-02-02 08:57:41,569 [20] [cmk.mkeventd.EventStatus] Loaded event state from /omd/sites/****/var/mkeventd/status.
2023-02-02 08:57:41,569 [20] [cmk.mkeventd.EventServer] Compiled 0 active rules (ignoring 0 disabled rules)
2023-02-02 08:57:41,569 [20] [cmk.mkeventd.EventServer] Rule hash: 0 rules - 0 hashed, 0 unspecific
2023-02-02 08:57:41,574 [20] [cmk.mkeventd] Daemonized with PID 898632.
2023-02-02 08:57:41,576 [20] [cmk.mkeventd.StatusServer] Starting up
2023-02-02 08:57:41,576 [20] [cmk.mkeventd.EventServer] Starting up
2023-02-02 08:59:42,468 [20] [cmk.mkeventd] Signalled to death by signal 15
2023-02-02 08:59:42,513 [20] [cmk.mkeventd.StatusServer] Terminated
2023-02-02 08:59:42,701 [20] [cmk.mkeventd.EventServer] Terminated
2023-02-02 08:59:42,811 [20] [cmk.mkeventd.EventServer] Top 20 of facility/priority:
2023-02-02 08:59:42,811 [20] [cmk.mkeventd] Successfully shut down.```

**Output of “cmk --debug -vvn hostname”:** : 

```OMD[****]:~$ cmk --debug -vvn checkmk-new
Checkmk version 2.1.0p16
Try license usage history update.
Trying to acquire lock on /omd/sites/****/var/check_mk/license_usage/next_run
Got lock on /omd/sites/****/var/check_mk/license_usage/next_run
Trying to acquire lock on /omd/sites/****/var/check_mk/license_usage/history.json
Got lock on /omd/sites/****/var/check_mk/license_usage/history.json
Next run time has not been reached yet. Abort.
Releasing lock on /omd/sites/****/var/check_mk/license_usage/history.json
Released lock on /omd/sites/****/var/check_mk/license_usage/history.json
Releasing lock on /omd/sites/****/var/check_mk/license_usage/next_run
Released lock on /omd/sites/****/var/check_mk/license_usage/next_run
Updating IPv4 DNS cache for checkmk-new: 10.10.0.4
Trying to acquire lock on /omd/sites/****/var/check_mk/ipaddresses.cache
Got lock on /omd/sites/****/var/check_mk/ipaddresses.cache
Releasing lock on /omd/sites/****/var/check_mk/ipaddresses.cache
Released lock on /omd/sites/****/var/check_mk/ipaddresses.cache
+ FETCHING DATA
  Source: SourceType.HOST/FetcherType.PIGGYBACK
[cpu_tracking] Start [7fa1edeb85e0]
[PiggybackFetcher] Fetch with cache settings: NoCache(checkmk-new, base_path=/omd/sites/****/tmp/check_mk/data_source_cache/piggyback, max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=True, use_outdated=False, simulation=False)
Not using cache (Cache usage disabled)
[PiggybackFetcher] Execute data source
No piggyback files for 'checkmk-new'. Skip processing.
No piggyback files for '10.10.0.4'. Skip processing.
Not using cache (Cache usage disabled)
[cpu_tracking] Stop [7fa1edeb85e0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
  Source: SourceType.HOST/FetcherType.PIGGYBACK
No persisted sections
  -> Add sections: []
Received no piggyback data
[cpu_tracking] Start [7fa1edeb8820]
value store: synchronizing
Trying to acquire lock on /omd/sites/****/tmp/check_mk/counters/checkmk-new
Got lock on /omd/sites/****/tmp/check_mk/counters/checkmk-new
value store: loading from disk
Releasing lock on /omd/sites/****/tmp/check_mk/counters/checkmk-new
Released lock on /omd/sites/****/tmp/check_mk/counters/checkmk-new
No piggyback files for 'checkmk-new'. Skip processing.
No piggyback files for '10.10.0.4'. Skip processing.
[cpu_tracking] Stop [7fa1edeb8820 - Snapshot(process=posix.times_result(user=0.09999999999999964, system=0.010000000000000009, children_user=0.0, children_system=0.0, elapsed=0.11000000312924385))]
execution time 0.1 sec | execution_time=0.110 user_time=0.100 system_time=0.010 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.000 ```







livestatus.log: 

```2023-02-02 08:31:25 [main] socket thread has terminated
2023-02-02 08:54:20 [main] socket thread has terminated
2023-02-02 08:56:20 [main] socket thread has terminated
2023-02-02 08:59:40 [main] socket thread has terminated

Hello!
Our Checkmk has a big problem. Some error happens 4-5 times a day, so OMD stops.
Unfortunately there is not much more in the log. If you do “omd start site”, it runs great - until the next failure.

Any ideas?

Thanks very much!
Fabian

First one question, from you log i only see a terminated mkeventd process. Is the whole site down or only some services?
The output of “cmk --debug -vvn hostname” will not help for such problems, it is only relevant for problems with checks or access to agent/SNMP.
What shows your core log at the crash time?

Hello Andreas!

omd status show:

agent-receiver: stopped
mkeventd:       stopped
rrdcached:      stopped
npcd:           stopped
nagios:         running
apache:         running
redis:          stopped
crontab:        running
-----------------------
Overall state:  partially running

agent-receiver.log:

[2023-02-13 13:50:03 +0100] [502] [INFO] Handling signal: term
[2023-02-13 13:50:03 +0100] [511] [INFO] Shutting down
[2023-02-13 13:50:03 +0100] [511] [INFO] Waiting for application shutdown.
[2023-02-13 13:50:03 +0100] [511] [INFO] Application shutdown complete.
[2023-02-13 13:50:03 +0100] [511] [INFO] Finished server process [511]
[2023-02-13 13:50:03 +0100] [511] [INFO] Worker exiting (pid: 511)
[2023-02-13 13:50:04 +0100] [502] [INFO] Shutting down: Master
[2023-02-13 13:50:13 +0100] [569] [INFO] Starting gunicorn 20.1.0
[2023-02-13 13:50:13 +0100] [569] [INFO] Listening at: https://0.0.0.0:8000 (569)
[2023-02-13 13:50:13 +0100] [569] [INFO] Using worker: agent_receiver.worker.ClientCertWorker
[2023-02-13 13:50:13 +0100] [574] [INFO] Booting worker with pid: 574
[2023-02-13 13:50:14 +0100] [574] [INFO] Started server process [574]
[2023-02-13 13:50:14 +0100] [574] [INFO] Waiting for application startup.
[2023-02-13 13:50:14 +0100] [574] [INFO] Application startup complete.

mkevent.log

2023-02-13 13:50:03,475 [20] [cmk.mkeventd] Signalled to death by signal 15
2023-02-13 13:50:03,526 [20] [cmk.mkeventd.StatusServer] Terminated
2023-02-13 13:50:14,768 [20] [cmk.mkeventd] -----------------------------------------------------------------
2023-02-13 13:50:14,769 [20] [cmk.mkeventd] mkeventd version 2.1.0p16 starting
2023-02-13 13:50:14,776 [20] [cmk.mkeventd.EventServer] Created FIFO '/omd/sites/****/tmp/run/mkeventd/events' for receiving events
2023-02-13 13:50:14,776 [20] [cmk.mkeventd.EventServer] Opened UNIX socket '/omd/sites/****/tmp/run/mkeventd/eventsocket' for receiving events
2023-02-13 13:50:14,779 [20] [cmk.mkeventd.EventStatus] Loaded event state from /omd/sites/****/var/mkeventd/status.
2023-02-13 13:50:14,779 [20] [cmk.mkeventd.EventServer] Compiled 0 active rules (ignoring 0 disabled rules)
2023-02-13 13:50:14,779 [20] [cmk.mkeventd.EventServer] Rule hash: 0 rules - 0 hashed, 0 unspecific
2023-02-13 13:50:14,784 [20] [cmk.mkeventd] Daemonized with PID 576.
2023-02-13 13:50:14,787 [20] [cmk.mkeventd.StatusServer] Starting up
2023-02-13 13:50:14,794 [20] [cmk.mkeventd.EventServer] Starting up

redis-server.log:

607:signal-handler (1676292594) Received SIGTERM scheduling shutdown...
607:signal-handler (1676292594) Received SIGTERM scheduling shutdown...
607:M 13 Feb 2023 13:49:54.684 # User requested shutdown...
607:M 13 Feb 2023 13:49:54.684 * Removing the pid file.
607:M 13 Feb 2023 13:49:54.684 * Removing the unix socket file.
607:M 13 Feb 2023 13:49:54.684 # Redis is now ready to exit, bye bye...
666:C 13 Feb 2023 13:50:16.155 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
666:C 13 Feb 2023 13:50:16.156 # Redis version=6.2.6, bits=64, commit=5e064e8e, modified=1, pid=666, just started
666:C 13 Feb 2023 13:50:16.156 # Configuration loaded
666:M 13 Feb 2023 13:50:16.157 * monotonic clock: POSIX clock_gettime
666:M 13 Feb 2023 13:50:16.162 * Running mode=standalone, port=0.
666:M 13 Feb 2023 13:50:16.162 # Server initialized
666:M 13 Feb 2023 13:50:16.162 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
666:M 13 Feb 2023 13:50:16.163 * The server is now ready to accept connections at /omd/sites/****/tmp/run/redis

livestatus.log

2023-02-13 13:49:56 [main] socket thread has terminated

cat /opt/omd/sites/----/var/pnp4nagios/log/npcd.log

[02-13-2023 13:49:57] NPCD: Caught Termination Signal - Astalavista... baby
[02-13-2023 13:50:14] NPCD: npcd Daemon (0.6.26) started with PID=595
[02-13-2023 13:50:14] NPCD: Please have a look at 'npcd -V' to get license information
[02-13-2023 13:50:14] NPCD: HINT: load_threshold is disabled - ('0.000000')
[02-13-2023 13:50:30] NPCD: ERROR: Executed command exits with return code '7'
[02-13-2023 13:50:30] NPCD: ERROR: Command line was '/omd/sites/****/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/****/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/****/var/pnp4nagios/spool//perfdata.1675535806'
[02-13-2023 13:50:30] NPCD: ERROR: Executed command exits with return code '7'
[02-13-2023 13:50:30] NPCD: ERROR: Command line was '/omd/sites/****/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/****/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/****/var/pnp4nagios/spool//perfdata.1675535836'

Those are the logs that are interesting. What do you mean by “core log”? Can’t find one like that

Thanks

core.log is in your case the nagios.log. For enterprise versions it would be the cmc.log.
All the logs you posted, looking like a “normal” “omd restart”.

Hey

nagios.log is full of CT / VM informations /SERVICE NOTIFICATION:/SERVICE FLAPPING ALERT… is that interesting for you?

Just “livestatus” often log:

[1676298819] livestatus: Timeperiod cache not updated, there are no timeperiods (yet)
[1676298879] livestatus: Timeperiod cache not updated, there are no timeperiods (yet)
[1676298939] livestatus: Timeperiod cache not updated, there are no timeperiods (yet)
[1676298999] livestatus: Timeperiod cache not updated, there are no timeperiods (yet)
[1676299059] livestatus: Timeperiod cache not updated, there are no timeperiods (yet)
[1676299119] livestatus: Timeperiod cache not updated, there are no timeperiods (yet)
[1676299179] livestatus: Timeperiod cache not updated, there are no timeperiods (yet)

Without a timestamp, I can’t read anything about the time it failed :-/

Hey again.
Since my last posting, the problem was solved (nothing changed), no crashes anymore. For two days, the site crashes every 4-5 hours again.

Can somebody help?