Omd.service: A process of this unit has been killed by the OOM killer

Today all of a sudden our omd machine blocked during an activate change and was completely dead on omd
Turned out the OOM killer killed the omd.service (pretty drastic)
The root cause was actually simple (I think) the machine had no swap configured, but I’m not sure about this

Strange thing is we could not detect any full memory usage in our memory monitoring, and the machine had 8G mem configured, with only 150hosts and 2500 services it really should not go full…
The image proofs this,

I presume the OOM Killer targeted the omd.service because of this : 
2024-01-04T15:15:59.699111+00:00 cmk systemd[1]: omd.service: State 'final-sigterm' timed out. Killing.
2024-01-04T15:15:59.699302+00:00 cmk systemd[1]: omd.service: Killing process 1159 (liveproxyd[mast) with signal SIGKILL.
2024-01-04T15:15:59.699345+00:00 cmk systemd[1]: omd.service: Killing process 1198 (liveproxyd[publ) with signal SIGKILL.
2024-01-04T15:15:59.701540+00:00 cmk systemd[1]: omd.service: Failed with result 'oom-kill'.
2024-01-04T15:15:59.702226+00:00 cmk systemd[1]: omd.service: Consumed 9h 53min 43.568s CPU time.

Does anybody have any idea why we don’t see this memory consumption in our graphs ?

  1. Swap is never a solution, if you are short on memory. Once your server starts swapping, you get degraded performance. So I would add a bit more memory, rather that swap.
  2. The graphs might not show the peak, as the site died. 8 GB is not necessary too little, but depending on the exact situation, it can be.
  3. OOM is a complex topic, but the easiest solution is almost always adding more memory.

Robin I fully agree adding memory is the solution, but I prefer a degraded service over a non working one and is swap the best option

Maybe I should alter the OOM params so it never kills the omd service ? Maybe this is something a default cmk install should do ?

Altering OOM parameters is a tricky thing, and personally I would avoid it.
If you know what you are doing, go ahead, but I cannot recommend it.
Because even if Checkmk stays online, the OOM could kill supporting processes and impact Checkmk anyway.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.