Windows Agent hangs when RAM is Full until Service is restarted manually

CMK version: 2.0.0p24 (CEE)
OS version: Debian 9.13 (CheckMK Server) Microsoft Windows Server 2012 R2 Standard (Monitored Host)

Error message:
2022-08-03 04:32:32.404 [srv 11748] Connected from ‘10.161.101.99’ ipv6 :false → queue
2022-08-03 04:32:32.405 [srv 11748] [Err ] queue is overflown

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
n/a

Hello, we have this Problem which occures when the RAM Usage of the Windows Server gets very high, near 100%. The Agent goes into a stale state where it throught the above error Message until it is restarted. Sadly this issue doesn’t resolve when enough memory is free again by itself.

the Log of the Agent looks like this, going from “Service is working” to “Stale” and “manual restart of the service by an admin”

2022-08-03 03:52:23.174 [srv 11748] perf: Section 'local' took [1] milliseconds
2022-08-03 03:52:23.200 [srv 11748] Received [128] bytes from 'local'
2022-08-03 03:52:35.483 [srv 11748] perf:  In [16076] milliseconds process 'powershell.exe -NoLogo -NoProfile -ExecutionPolicy Bypass -File "C:\ProgramData\checkmk\agent\plugins\windows_if.ps1"' pid:[32380] SUCCEDED - generated [0] bytes of data in [0] blocks
2022-08-03 03:52:35.485 [srv 11748] [Warn ] Process 'C:\ProgramData\checkmk\agent\plugins\windows_if.ps1' has no data
2022-08-03 03:55:29.190 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 03:58:20.057 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:01:29.392 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:04:20.244 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:05:30.351 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:07:30.398 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:10:21.240 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:11:30.595 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:13:30.899 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:16:21.434 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:17:31.795 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:19:31.779 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:22:21.633 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:23:31.965 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:25:31.938 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:28:22.584 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:29:33.375 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:29:33.377 [srv 11748] [Err  ] queue is overflown
2022-08-03 04:31:32.062 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:31:32.063 [srv 11748] [Err  ] queue is overflown
2022-08-03 04:32:32.404 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:32:32.405 [srv 11748] [Err  ] queue is overflown
2022-08-03 04:34:22.767 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:34:22.768 [srv 11748] [Err  ] queue is overflown
2022-08-03 04:35:23.070 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 04:35:23.072 [srv 11748] [Err  ] queue is overflown
...
2022-08-03 07:47:24.948 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 07:47:24.950 [srv 11748] [Err  ] queue is overflown
2022-08-03 07:48:25.165 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 07:48:25.166 [srv 11748] [Err  ] queue is overflown
2022-08-03 07:49:25.372 [srv 11748] Connected from '10.161.101.99' ipv6 :false -> queue
2022-08-03 07:49:25.374 [srv 11748] [Err  ] queue is overflown
2022-08-03 07:50:00.483 [srv 11748] Initiating stop routine...
2022-08-03 07:50:00.484 [srv 11748] Stop Service called
2022-08-03 07:50:00.486 [srv 11748] [Trace] Stop request is set
2022-08-03 07:50:00.488 [srv 11748] [Trace] main Wait Loop END
2022-08-03 07:50:00.489 [srv 11748] Shutting down IO...
2022-08-03 07:50:00.490 [srv 11748] [Trace] Stopping execution
2022-08-03 07:51:19.915 [srv 18500] [Trace] Enabled Base
2022-08-03 07:51:19.923 [srv 18500] [Trace] Setting root. service: 'CheckMkService', preset: ''
2022-08-03 07:51:19.924 [srv 18500] [Trace] Try service: 'CheckMkService'
2022-08-03 07:51:19.925 [srv 18500] [Trace] Try registry 'CheckMkService'
2022-08-03 07:51:19.927 [srv 18500] [Trace] Service is found 'C:\Program Files (x86)\checkmk\service\check_mk_agent.exe'

the only thing i could find about the error “[Err ] queue is overflown” is in the code of the agent in this file in line 244

I’m not a programmer myself but i can wildly guess considering the error message and code snippet that throws the error:

  • a problem with the connection queue from the cmk server, maybe an open session doesn’t get released so it queues the new connection?
  • the thread of the agent cannot be woken up from a sleep / idle state?

Thanks for any insight you can provide

Hi

I can’t guarantee that the Windows Agent will work correctly if the RAM is low.
This use case was never tested or assigned for testing.

About the overflown queue.

There two threads to process connection in WIndows agent
Thread number 1 is accepting incoming connection and place this connection on a shared queue.
Thread number 2 is picking up constantly a first connection from the queue and process it.
In the log you could see something like

2022-08-03 13:36:17.750 [app 40200] Connected from '127.0.0.1:51476' ipv4 -> queue
2022-08-03 13:36:17.750 [app 40200] Connected from '127.0.0.1' ipv4 port: 51476 <- queue

First line is produced by the first thread
The second one - by the second thread

In your case, the second thread just sit in the hang state. Probably (highly probably) because of internal error due to the lack of memory.
Just FYI, Windows agent may require a LOT of RAM, because of

  • Win32 API may require megabytes(temporary)
  • external plugins may require hundreds of megabytes(PowerShell or python)
  • spawned own process to gather perf data requires at least 10-20 MB.

I would suggest

  • increase RAM (if it is possible)
  • increase seriously pagefile size(it is always possible)
  • if you are using 32-bit Windows, then switch ASAP onto 64 bit: Lack of address space is not solvable.
  • check windows eventlog for strange events
  • enable crashing monitoring to see whether windows agent crashes

Regards,
Sergej Kipnis