Stale Check_MK Discovery: (Service Check Timed Out)

peterge · June 24, 2024, 6:15am

CMK version: Checkmk Raw Edition 2.2.0p27
OS version: Debian 11

Error message:
Sometimes these services for 20+ hosts show up as CRIT:

If I run “Reschedule Check”, they become OK again.
I do not get why I have to do this manually, why they become stale like this.
Rebooting the host does not fix this.
Is there any way I can mass reschedule all of them?

I guess the best for me is to upgrade to 2.3?

peterge · June 24, 2024, 6:57am

Just updated to 2.3, those arent showing up anymore. I guess this can be marked as solved.

peterge · June 28, 2024, 6:30am

The same is happening on 2.3.0p6

peterge · June 28, 2024, 8:05am

This is not a solution, but a workaround I just found: By displaying the reschedule icon, I can reschedule the services much faster: How to show "Reschedule check" button for all hosts? - #2 by aeckstein

Norm · June 30, 2024, 2:12pm

Hi @peterge,

can you please share the server specs of the Checkmk server? (CPU, RAM)
And can you share how many hosts and services you are trying to monitor with that affected Checkmk site?

Thanks in advance!

Norm

peterge · July 1, 2024, 9:07am

We havee only one site on the server (Proxmox LXC container with debian 11.10). These are the specs:

RAM: 4GB

SWAP: 512MB
CPU Cores: 8

Disk: 100GB

Norm · July 1, 2024, 9:54am

Hi @peterge,

running the Checkmk server inside a LXC Container is not recommended and is probably the root cause for the issues you are experiencing.

My suggestion would be to move your Checkmk server to a proper VM, or maybe give this HowTo a try.

Regards
Norm

andreas-doehler · July 1, 2024, 10:02am

Why should this be the root cause? It is perfectly fine inside an LXC.
I have some systems running in this configuration.

Norm · July 1, 2024, 10:07am

I had issues in the past with LXC setups. Moving to a VM or Bare-metal always helped in my cases.

I never tried the HowTo: [any edition] on a (proxmox) LXC container, but maybe it can help.

peterge · July 1, 2024, 10:32am

Weird. We never experienced this problem before upgrading to 2.2.0p27 & 2.3.0p6.
IMHO this cant be a problem caused by LXC…

andreas-doehler · July 1, 2024, 10:53am

Is it only the discovery service with the timeout problem?
If yes - i would go to the command line and do a cmk --debug -vvI hostname to check what happens at discovery time.

peterge · July 1, 2024, 10:54am

Yes, only the discovery service is becoming stale, every other service is just running fine. I will post debug when it occurs next time

Rocksalt · July 4, 2024, 7:50am

Hi. Just to say, we experienced this problem recently as well. But cmk --debug -vvI would generally work fine for us, and if we rescheduled that check it would go green again.

We fixed it by doing a “Reschedule Active Checks” on all the Check_MK Discovery services, spread over 60 minutes.

My theory is that maybe the checks all got bunched up together somehow, and then were all trying to run simultaneously. Is that possible?

gawainsr2 · January 31, 2025, 6:32pm

I’ve been seeing this as well, Is there a way to sofen the timeout threshold a bit to mitigate this?

jjayhill21 · October 29, 2025, 2:55pm

I too am seeing this with CRE version 2.3.0p36. We’ve recently upgraded from 2.0.0 but we’ve also recently added some 400 Windows hosts onone site and that site is having the problem.

It seems that all discoveries run at the same time, per schedule. It causes load on the site’s server, but the network is probably the more constrained choke point. So, adding more CPU or memory will probably not work around the problem.

What I really need is to spread the discovery checks out over time. I don’t know how to do that.

jjayhill21 · October 29, 2025, 7:23pm

It looks like " auto_reschedule_checks" in the Nagios configuration will solve my problem.