Handle leak caused by CheckMK agent on Windows Servers

I just want to share with you my postmortem:

CheckMK agent is leaking and it poses a significant threat to system stability/availability on all Windows servers.
A handle leak is a type of software bug that occurs when a computer program asks for a handle to a resource but does not free the handle when it is no longer used. If this occurs frequently or repeatedly over an extended period of time, a large number of handles may be marked in-use and thus unavailable, causing performance problems or a crash.

Applies to:
All Windows Server OS (2012/2016/2019)
with the latest available checkmk agent version

Identified Root Cause
It seems checkmk is not able to keep up with security eventlog entries when “logfile security” is enabled in the config file. Security logfile is set to 256MB on all servers with overwriting the latest entries. Handle leaking was most obvious on DCs and VDI servers where a lot of events are logged in the eventlog (eg: ~300k events/6hours).
Handle count was rising continuously till the servers crashed. The highest handle count observed before server crashed was 1.1million (and maybe 11GBs of RAM, I can’t recall precisely).

Restart the agent periodically (daily)

Security eventlog parsing was disabled by commenting out the line “logfile security” and restarting the agent. Handle leak was not observered at all in the past weeks. Highest handle count on the checkmk process since security logfile parsing is disabled was 232.

my personal sidenote
This solution is not real solution. Buggy, leaking software code is still there. Community might want to look for the real root cause and fix it :slight_smile:

1 Like

you need to notice, that checkmk is no log management solution. At this point you need to decide if this massive log activity handlebar as information. Nobody can handle this huge events. If this information not needed in monitoring, please switch off the Windows Logs on Windows Agent.

You are sure that this also happens with 1.6 or 2.0 agent? The screenshot shows an old 1.4/1.5 agent.

@ChristianM the problem is not if this log is transferred. It happens already if you only activate the security log in the agent config to be processed.

But it would be good to see some tests with an actual agent.

You are sure that this also happens with 1.6 or 2.0 agent? The screenshot shows an old 1.4/1.5 agent.

Yes, I’m sure. Same package was rolled out to all of our servers:

PS C:\Program Files (x86)\check_mk> .\check_mk_agent.exe version
Check_MK_Agent version 1.6.0p11