Large Cache Files

jsmyth · February 9, 2023, 4:31pm

CMK version: Checkmk Raw Edition 2.1.0p16
OS version: Ubuntu 20.04 LTS (server), Ubuntu 18.04 LTS (agent)

Error message: N/A

We have an issue with a couple of our monitored systems generating (comparatively) large cache files on the Checkmk server. For most hosts the cache file is a reasonable size (< 100kiB for Windows hosts, ~150 kiB for regular Linux hosts and ~400kiB for Kubernetes nodes), but we have a couple of hosts where the cache file is considerably larger (1.25MiB for one, 3.71MiB for the other).

On both affected hosts, the bulk of the data is in the [status] portion of the systemd_units section. The start of the [status] portion looks like this:

[status]
● someHost.domain.com
 State: running
 Jobs: 0 queued
 Failed: 0 units
 Since: Thu 2023-01-05 23:24:24 CST; 1 months 3 days ago
 CGroup: /
 ├─kubepods
 │ ├─pod505bce0a-b74f-4740-b007-9b43c136fe21
 │ │ ├─dca7e607efc97003e962eb407cde679a4543d90eb8a87ed10ed7c925980d5a44
 │ │ │ └─7657 /server
 │ │ └─6466216afd074fb0a660566a21be1f243fd829e6087ff3a978ce73f9fa46f0cd
 │ │ └─6074 /pause
 │ ├─burstable
 │ │ ├─podb76ecf9f-a80b-4369-8d70-ee8500c996b1
 │ │ ├─podb7640d11-b027-4022-9ea1-0d89839ea691
 │ │ ├─pod2138cf90-4da2-44e9-a8c2-15d6b34b75b3

From here there are thousands (~65k, in the case of the 3.7MiB file) of lines like the last 3 shown here. Has anyone seen anything like this before? We would like to understand why the cache files for the affected hosts are so large.

In case it is relevant, we are currently monitoring 2 separate Kubernetes clusters, and there is 1 affected host (out of 3) in each cluster.

Thanks in advance,
Jason

r.sander · February 10, 2023, 12:52pm

Is there really an issue or have you just stumbled across the file sizes?

jsmyth · February 10, 2023, 3:14pm

Hi Robert,

Fair question.

We are using the agentfiles plugin available here to monitor file sizes:

The larger file is big enough to trip this monitor into a WARNING state, so from that perspective there is an issue.

With that said, the real issue may simply be a lack of understanding on our part. Why is the systemd_units section so large for the affected hosts? Why is there only one host per Kubernetes cluster affected? Is this normal or an indication that there is something we should be paying more attention to?

Any advice is appreciated.

Thank you,
Jason

system · February 10, 2024, 3:15pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.