Today all of a sudden our omd machine blocked during an activate change and was completely dead on omd
Turned out the OOM killer killed the omd.service (pretty drastic)
The root cause was actually simple (I think) the machine had no swap configured, but I’m not sure about this
Strange thing is we could not detect any full memory usage in our memory monitoring, and the machine had 8G mem configured, with only 150hosts and 2500 services it really should not go full…
The image proofs this,
I presume the OOM Killer targeted the omd.service because of this :
2024-01-04T15:15:59.699111+00:00 cmk systemd[1]: omd.service: State 'final-sigterm' timed out. Killing.
2024-01-04T15:15:59.699302+00:00 cmk systemd[1]: omd.service: Killing process 1159 (liveproxyd[mast) with signal SIGKILL.
2024-01-04T15:15:59.699345+00:00 cmk systemd[1]: omd.service: Killing process 1198 (liveproxyd[publ) with signal SIGKILL.
2024-01-04T15:15:59.701540+00:00 cmk systemd[1]: omd.service: Failed with result 'oom-kill'.
2024-01-04T15:15:59.702226+00:00 cmk systemd[1]: omd.service: Consumed 9h 53min 43.568s CPU time.
Does anybody have any idea why we don’t see this memory consumption in our graphs ?
