CMK version:
2.4.0p7 RAW
OS version:
Redhat 8
Error message:
Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
Hello everyone,
I’m facing a persistent timeout issue when monitoring a large Kubernetes cluster using Checkmk’s built-in Kubernetes integration on version 2.4.0p7 RAW.
Context
- Checkmk version: 2.4.0p7 RAW
- Integration method:
agent_kubevia rule under Setup > Agents > VM, cloud, container > Kubernetes - Cluster size: Roughly 180+ nodes and 1900+ pods
- Selected objects: Only
Deploymentsare collected to reduce load
Issue
When launching service discovery or running a manual check from CLI, I get a timeout error such as:
Error running automation call service-discovery-preview: Your request timed out after 110 seconds.
This issue may be related to a local configuration problem or a request which works with a too large number of objects.
Command line output:
OMD[mysite]:~$ cmk --debug -vvn <my_k8s_host>
[special_kube] Success, [piggyback] Success (but no data found for this host), execution time 244.7 sec
The data collection completes successfully but takes more than 240 seconds, which causes the Checkmk web interface and discovery to fail due to timeouts.
What I’ve already tried
- Enabled the TCP timeout options in the rule (Connect timeout = 60, Read timeout = 400)
- Increased execution timeouts by editing
global.mk:
execution_timeouts = {'check': 600, 'discover': 600}
- Limited the scope to only Deployments (no Pods, Nodes, etc.)
- Tried both direct and Cluster Collector endpoints
- Manually ran
agent_kube: returns valid output after ~4 minutes
What I’m looking for
- A reliable way to avoid timeouts for this large Kubernetes cluster
- Whether it’s better to switch to the caching/piggyback mode (e.g. via agent or cron-based script)
- Confirmation on whether this is an expected limitation or if improvements are planned
Thank you in advance for any advice or insight you can provide.
Best regards,