CMK version: Checkmk Raw Edition 2.1.0p16
OS version: Ubuntu 20.04 LTS (server), Ubuntu 18.04 LTS (agent)
Error message: N/A
We have an issue with a couple of our monitored systems generating (comparatively) large cache files on the Checkmk server. For most hosts the cache file is a reasonable size (< 100kiB for Windows hosts, ~150 kiB for regular Linux hosts and ~400kiB for Kubernetes nodes), but we have a couple of hosts where the cache file is considerably larger (1.25MiB for one, 3.71MiB for the other).
On both affected hosts, the bulk of the data is in the [status]
portion of the systemd_units
section. The start of the [status]
portion looks like this:
[status]
● someHost.domain.com
State: running
Jobs: 0 queued
Failed: 0 units
Since: Thu 2023-01-05 23:24:24 CST; 1 months 3 days ago
CGroup: /
├─kubepods
│ ├─pod505bce0a-b74f-4740-b007-9b43c136fe21
│ │ ├─dca7e607efc97003e962eb407cde679a4543d90eb8a87ed10ed7c925980d5a44
│ │ │ └─7657 /server
│ │ └─6466216afd074fb0a660566a21be1f243fd829e6087ff3a978ce73f9fa46f0cd
│ │ └─6074 /pause
│ ├─burstable
│ │ ├─podb76ecf9f-a80b-4369-8d70-ee8500c996b1
│ │ ├─podb7640d11-b027-4022-9ea1-0d89839ea691
│ │ ├─pod2138cf90-4da2-44e9-a8c2-15d6b34b75b3
From here there are thousands (~65k, in the case of the 3.7MiB file) of lines like the last 3 shown here. Has anyone seen anything like this before? We would like to understand why the cache files for the affected hosts are so large.
In case it is relevant, we are currently monitoring 2 separate Kubernetes clusters, and there is 1 affected host (out of 3) in each cluster.
Thanks in advance,
Jason