Stale Check_MK Discovery: (Service Check Timed Out)

CMK version: Checkmk Raw Edition 2.2.0p27
OS version: Debian 11

Error message:
Sometimes these services for 20+ hosts show up as CRIT:

If I run “Reschedule Check”, they become OK again.
I do not get why I have to do this manually, why they become stale like this.
Rebooting the host does not fix this.
Is there any way I can mass reschedule all of them?

I guess the best for me is to upgrade to 2.3?

Just updated to 2.3, those arent showing up anymore. I guess this can be marked as solved.

1 Like

The same is happening on 2.3.0p6 :frowning:

This is not a solution, but a workaround I just found: By displaying the reschedule icon, I can reschedule the services much faster: How to show "Reschedule check" button for all hosts? - #2 by aeckstein

Hi @peterge,

can you please share the server specs of the Checkmk server? (CPU, RAM)
And can you share how many hosts and services you are trying to monitor with that affected Checkmk site?

Thanks in advance!

Norm

We havee only one site on the server (Proxmox LXC container with debian 11.10). These are the specs:

RAM: 4GB


SWAP: 512MB
CPU Cores: 8

Disk: 100GB

image

Hi @peterge,

running the Checkmk server inside a LXC Container is not recommended and is probably the root cause for the issues you are experiencing.

My suggestion would be to move your Checkmk server to a proper VM, or maybe give this HowTo a try.

Regards
Norm

Why should this be the root cause? It is perfectly fine inside an LXC.
I have some systems running in this configuration.

2 Likes

I had issues in the past with LXC setups. Moving to a VM or Bare-metal always helped in my cases. :slight_smile:

I never tried the HowTo: [any edition] on a (proxmox) LXC container, but maybe it can help.

Weird. We never experienced this problem before upgrading to 2.2.0p27 & 2.3.0p6.
IMHO this cant be a problem caused by LXC…

Is it only the discovery service with the timeout problem?
If yes - i would go to the command line and do a cmk --debug -vvI hostname to check what happens at discovery time.

Yes, only the discovery service is becoming stale, every other service is just running fine. I will post debug when it occurs next time

Hi. Just to say, we experienced this problem recently as well. But cmk --debug -vvI would generally work fine for us, and if we rescheduled that check it would go green again.

We fixed it by doing a “Reschedule Active Checks” on all the Check_MK Discovery services, spread over 60 minutes.

My theory is that maybe the checks all got bunched up together somehow, and then were all trying to run simultaneously. Is that possible?

1 Like

I’ve been seeing this as well, Is there a way to sofen the timeout threshold a bit to mitigate this?

I too am seeing this with CRE version 2.3.0p36. We’ve recently upgraded from 2.0.0 but we’ve also recently added some 400 Windows hosts onone site and that site is having the problem.

It seems that all discoveries run at the same time, per schedule. It causes load on the site’s server, but the network is probably the more constrained choke point. So, adding more CPU or memory will probably not work around the problem.

What I really need is to spread the discovery checks out over time. I don’t know how to do that.

It looks like " auto_reschedule_checks" in the Nagios configuration will solve my problem.