Windows Update Check Memory Leak

CMK version: 2.0.0.p23
OS version: Windows Server 2019 Core

Error message: As long as i have the check for windows updates installed and use it, my memory leaks over days, i can see that is the windows update service that keeps pulling more and more memory until the mashine kills it or crashs. i looked into many others things before remembering the check for windows updates and removed that, and now the machine is running normaly (with checkmk still working)

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

As there are no confirming reports, this seems like an isolated issue with your server’s update component. Is this at all reproducible on other servers?

yes it happend on all my servers, no matter if it was raw or enterprise edition.
so i guess there is an configuration that i use that others don’t. it is in total over 50 windows servers (ranging from 2016 to 2019, core and desktop experinces)
the only thing i may use that other don’t:
i use that powershell module PSWindowsUpdate for installing the updates, but as i only run this once a month, i cant see an reason why this should interfere

We have the same issue, just as the original post describes. Had it happening when using CheckMK version 1.6.0p24, and now that we have CheckMK 2.2.0p3, it is still happening.

Over the past month, I’ve run a test to verify this. I created two Windows 10 virtual machines:

  1. TEST_WITH
  2. TEST_WITHOUT

The machines are nearly identical, being created from the same VM template. After they were created, they got all existing Windows updates, and the CheckMK agent was applied to both… except for one difference: the machine “TEST_WITH” has the Windows update plugin, which TEST_WITHOUT does not. Neither machine has been doing anything, other than nightly backups by the Rubrik VM backup system.

Why do this test? I’ve been seeing for a long time that our Windows servers monitored in CheckMK would run out of RAM, and the memory usage service graph would all show a steady uptick in usage over a matter of days, as if there is a memory leak. If I open Task Manager on one of the problem machines, it will show something like this:

image

Restart the computer, and memory usage is normal – for a while – but creeps upward. This needed to be tested to be sure, so I setup these two VMs with the only being their IP addresses (unavoidable but decidedly immaterial to the test), and the presence or absence of the Windows update plugin for CheckMK.

Here is TEST_WITHOUT memory graph for the past 35 days:

As you can see, it has reached a steady-state of RAM usage and has held there. Task Manager on this machine shows the Windows Update service process like this:
image

This seems pretty normal.

On the TEST_WITH machine over the same 35 day time period, this is how the service graph looks:

I believe the drop-off in memory usage for both systems on 8/9 was due to a restart of both systems that day.

Interestingly, the memory service graph drops out on 8/23, and a check of Task Manager on this machine looks normal, though obviously the CheckMK agent isn’t reporting back. Note: this is unusual behavior. Normally the machines become so burdened with the lack of RAM that they simply stop functioning, to the point they won’t even display video on the VM, requiring a hard power reset.

Here is the check_mk.user.yml file on TEST_WITH:

---
global:
  enabled: true
plugins:
  enabled: true
  execution:
    - pattern: $CUSTOM_PLUGINS_PATH$\windows_udpates.vbs
      run: yes
      cache_age: 900
      async: yes
      timeout: 120
      retry_count: 3
logwatch:
  enabled: false
  sendall: false
  vista_api: no
  skip_duplicated: true
  max_size: 500000
  max_line_length: -1
  max_entries: -1
  timeout: -1
  logfile:
    - "*": off context
    - Parameters: ignore
    - State: ignore

Here is that same file on TEST_WITHOUT (spoiler: the files are identical):

---
global:
  enabled: true
plugins:
  enabled: true
  execution:
    - pattern: $CUSTOM_PLUGINS_PATH$\windows_udpates.vbs
      run: yes
      cache_age: 900
      async: yes
      timeout: 120
      retry_count: 3
logwatch:
  enabled: false
  sendall: false
  vista_api: no
  skip_duplicated: true
  max_size: 500000
  max_line_length: -1
  max_entries: -1
  timeout: -1
  logfile:
    - "*": off context
    - Parameters: ignore
    - State: ignore

I’ve checked with the IT manager at another of our business offices in another state, and he reports seeing the same thing on his VMs there. This isn’t an isolated issue, but not a well reported issue.

My suspicion is that the windows_update.vbs plugin runs but doesn’t free up all the resources it uses, then runs again a few minutes later, and again, and again, each time building up the amount of resources being consumed, a classic memory leak.

1 Like

Update:
In order to slow down the windows_update.vbs memory leak problem, I put in place a procedure to automatically restart the Windows Update service every six hours. This does free up memory being used by this service (good thing), BUT…

In this pic, you see two regions over the red lines. Those show the memory leak prior to auto-restarts of the Windows Update service.
Then there is the region over the yellow line. That shows RAM usage after auto-restarts of Windows Update.

It is better but unfortunately it still is going up over time, just not as quickly. This is still a problem from the windows_update.vbs plugin running on a machine… our test machine that isn’t running that plugin doesn’t show this problem.

I talked to our development, and we can say with a huge degree of certainty: This is not a Checkmk issue. We do not know why it affects just some users, and we did not recognize any pattern at this point. It is the Windows Updates service that is misbehaving and hogging the memory. The Checkmk agent only triggers this issue, but is not the reason.

Fingers crossed, that someone will eventually figure it our, or Microsoft releasing a fix for this (if they are even aware of the issue).

This settings can also be, to some extend, the reason for this problem.
Only 15 minutes of cache age, why?

1 Like

Why not? If this is increased, does it stop the memory leak? Does it cut down on the responsiveness of reporting of the Windows update info? What is the recommended value?

Pat

On most systems it is not needed to check more than one or two times a day.

If it stops, i don’t know, but you should see a way slower increase.
Two checks per day compared to 96 checks with your config.

damn, i copied this typo and was wondering why my update script doesn’t work :rofl:

1 Like

So in essence, this is just slowing the memory leak, it isn’t an actual fix, right?

Other than the memory leak, are there any other reasons not to run that plugin as frequently as I am?

I’m really curious as to what is chewing up all that RAM…

Hahahahahaha! I kept reading and reading and reading your post, not understanding what you were talking about… and then I FINALLY noticed it.

Thakns! :wink:

One: The memory leak is on Windows site of things, not Checkmk.
Two: Well, how often does Microsoft release updates? The last time I checked, the infamous “patch Tuesday” is still there. Of course there are sometimes intermittent updates, but then again: How frequently do you actually update your Windows servers? I think this should be enough food for thought on the interval. :slight_smile:

One: I don’t think the memory leak is on the Windows side (I’ll explain in a moment).
Two: By that logic, why even check for updates with CheckMK at all then? Just apply your updates once-a-month and be done, right? Well, maybe you initiate your updates from a WSUS server. Once you do, you want to see which machines really require updates. Oops, if you only check for updates twice a day, CheckMK is going to be a poor way to seeing if your machines need an update, unless you’re lucky and the infrequent check just happens to be right after you approve updates from WSUS. And as you apply updates to your computers, you don’t have feedback on which were done or not (or might have additional updates not applied on the first round). Well, you do have feedback, but have to wait 12 hours to get it.

That is, unless you check for updates more frequently. 15 minutes isn’t too bad (which is what we’re using). And yes, I release a bunch of WSUS updates, then look at our server status dashboard and see what servers have updates waiting:

Updates can be applied to a server, rebooted, and sometimes there are still updates waiting. Why? Because some only become visible once the rest are done. But waiting for 12 hours for feedback isn’t efficient. If I see yellow or red in the “updates” column, I know there are updates pending for that system. You can see a couple of machines already that have updates pending.

Now, back to issue one…

As @gerald.endres pointed out, there was a typo in my check_mk.user.yml file… I had misspelled “updates” as “udpates”. Oops. Whatever was going on, this was at the CheckMK level. Once the spelling was corrected, the YML file was pushed out to all the affected computers, and the CheckMK service restarted. Since then…

The frequency of the checks has remained unchanged. I did turn off the auto-restarts of the Windows Update service, since that was really a crutch.

What I don’t know is this:
If the check_mk.user.yml file had a misspelling in the path to the windows_updates,vbs file, then that file shouldn’t have been able to be located, should have never run, and this shouldn’t even be an issue. But it was still running (the function was happening on my dashboard, and the memory leak was using the Windows Update function). So there must be another location in the configuration that has the correct path to the plugin, and that other location must be being used. How? And once the check_mk.user.yml plugin path typo was corrected, the memory leak problems seemed to have gone away. Did the typo somehow cause a situation where an instance of Windows Update was accessed but never actually closed, and this repeats every fifteen minutes, causing a loss of free RAM? If so, then yes, Windows has a bunch of copies of one of its components open, but isn’t that at the behest of CheckMK?

If you don’t specify conditions for plugin scripts inside the “$CUSTOM_PLUGINS_PATH$” then the scripts are running at every check interval (one minute or what you use on your system).

This is a default setting that every script inside "C:\ProgramData\checkmk\agent\plugins" is running.

3 Likes

Most users I know use the update monitoring to catch stray systems, which were not updated for longer periods of time. They do not use it real-time-ish. They have some form of automation or procedure in place, that takes care of patching primarily. Checkmk verifies proper functionality of these procedures.
Of course, one can run the plugin more often at an increased load on the monitored system. But as you say, that might be perfectly fine, if the use-case dictates it.

1 Like