I just want to share with you my postmortem:
Problem:
CheckMK agent is leaking and it poses a significant threat to system stability/availability on all Windows servers.
A handle leak is a type of software bug that occurs when a computer program asks for a handle to a resource but does not free the handle when it is no longer used. If this occurs frequently or repeatedly over an extended period of time, a large number of handles may be marked in-use and thus unavailable, causing performance problems or a crash.
Applies to:
All Windows Server OS (2012/2016/2019)
with the latest available checkmk agent version
Identified Root Cause
It seems checkmk is not able to keep up with security eventlog entries when “logfile security” is enabled in the config file. Security logfile is set to 256MB on all servers with overwriting the latest entries. Handle leaking was most obvious on DCs and VDI servers where a lot of events are logged in the eventlog (eg: ~300k events/6hours).
Handle count was rising continuously till the servers crashed. The highest handle count observed before server crashed was 1.1million (and maybe 11GBs of RAM, I can’t recall precisely).
Workaround:
Restart the agent periodically (daily)
Solution
Security eventlog parsing was disabled by commenting out the line “logfile security” and restarting the agent. Handle leak was not observered at all in the past weeks. Highest handle count on the checkmk process since security logfile parsing is disabled was 232.
my personal sidenote
This solution is not real solution. Buggy, leaking software code is still there. Community might want to look for the real root cause and fix it