Kubernetes Integration Timeout on Checkmk 2.4.0p7 RAW

gregorym35 · July 23, 2025, 2:56pm

CMK version:
2.4.0p7 RAW
OS version:
Redhat 8
Error message:

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

Hello everyone,

I’m facing a persistent timeout issue when monitoring a large Kubernetes cluster using Checkmk’s built-in Kubernetes integration on version 2.4.0p7 RAW.

Context

Checkmk version: 2.4.0p7 RAW
Integration method: agent_kube via rule under Setup > Agents > VM, cloud, container > Kubernetes
Cluster size: Roughly 180+ nodes and 1900+ pods
Selected objects: Only Deployments are collected to reduce load

Issue

When launching service discovery or running a manual check from CLI, I get a timeout error such as:

Error running automation call service-discovery-preview: Your request timed out after 110 seconds.
This issue may be related to a local configuration problem or a request which works with a too large number of objects.

Command line output:

OMD[mysite]:~$ cmk --debug -vvn <my_k8s_host>
[special_kube] Success, [piggyback] Success (but no data found for this host), execution time 244.7 sec

The data collection completes successfully but takes more than 240 seconds, which causes the Checkmk web interface and discovery to fail due to timeouts.

What I’ve already tried

Enabled the TCP timeout options in the rule (Connect timeout = 60, Read timeout = 400)
Increased execution timeouts by editing global.mk:

execution_timeouts = {'check': 600, 'discover': 600}

Limited the scope to only Deployments (no Pods, Nodes, etc.)
Tried both direct and Cluster Collector endpoints
Manually ran agent_kube: returns valid output after ~4 minutes

What I’m looking for

A reliable way to avoid timeouts for this large Kubernetes cluster
Whether it’s better to switch to the caching/piggyback mode (e.g. via agent or cron-based script)
Confirmation on whether this is an expected limitation or if improvements are planned

Thank you in advance for any advice or insight you can provide.

Best regards,

andreas-doehler · July 23, 2025, 7:53pm

I think this will be the only option with this cluster.
That means the local script is executed every 5 minutes and fetches the agent output. Then you can put this output inside the “spool” folder on the host. There you can give this spool file a name that it is valid for some minutes. With the normal check_mk agent contact it is then transferred to CheckMK and should produce the piggyback data for all the objects.