CheckMK Raw Host & Service Check Timeouts

CMK version: 2.0.0p18 (CRE)
**OS version: Debian 10 **

Hello guys,

we have problems with the performance in general with our CheckMK server.

Regularly, like every 3 minutes, many hosts are shown as Down - Flapping (Host Check Timed Out), see attached picture. So the system is not really useable.
This causes then CPU spikes on the server and this sometimes even leads to a nagios crash:
“Caught SIGSEGV, shutting down…”

Over the years the configuration got changed so we think there is a misconfiguration.

Please can someone give us an advice if these settings are okay or maybe inconvenient.
How should it be as a best practice?

Hardware of the Linux VM:
Debian 10
12 vCPU
12 GB RAM

Following the configuration:

/etc/nagios/nagios.cfg
service_check_timeout=120

Wato Global Settings
Agent TCP connect timeout: 30
Staleness value to mark hosts / services stale: 30
Apache process tuning: 96

Rules
Maximum number of check attempts for host: 3
Normal check interval for host checks: 3 minutes
Retry check interval for host checks: 2 minutes

Maximum number of check attempts for service: 3
Normal check interval for service checks: 5 minutes
Retry check interval for service checks: 2 minutes

Timing settings for SNMP access
Timeout: 45 sec
Retries: 2
Fetch intervals for SNMP sections: not set

Periodic service discovery: Do not perform periodic service discovery check

I think these are the important settings. If there is something else we should take care of please let me know.

Best regards
Tobias

Services
Timeout

Hallo,
a CEE version may help in all aspects of performance and tuning :slight_smile:

Hardisktype ssd?

12 Gb Ram. Is it possible to allocate more ram?

Ralf

1 Like

Hallo Ralf,

thanks for the fast answer. Yes we know that the CEE is way more performant than the CRE.

But we dont expect too much from the CRE. We just want that it is useable. And it did very well in the past but at the moment in our case we get many of the “Host Check Timed Out” and dont know where this comes from. Not easy to recognize a real host problem when there are all day these timeouts,

Chances are high that there is some bad configuration.

  1. Hard disk type is not SSD / we use 10k HDDs for the system (VM).
    At the moment not possible to move to SSDs.

  2. I increased to 18 GB RAM now but i guess this is not the problem cause looking at the RAM shows that usually only around 6 GB are used.

Regards

Ok,
the hostsystem is ok?
No other VMs with high load or similiar problems?
No messages in the controller tools or the hypervisor with messages about disk problems.
If everything ist fine you should your checks a little bis more detailed.
Perhaps a reorganisation of the checks my help.
snmp for example is often a problem.
Ralf

The hostsystem is fine. No other VMs with the same problems.
No disk problems as well.

Yes we need to check.

And how about our config settings? Are they fine?

Can we set in the nagios.cfg something like
host_check_timeout=60 ?

Will this be effective? Everyone is always talking about the service check timeout but what about the host check timeout?

Btw: Nearly every host times out at the same time. No specific hosts. It is a mix out of normal Windows or Linux Hosts and SNMP devices.

Regards
Tobias

1 Like

The number of hosts and services hit the Nagios limits. You have these problems because of all the forking Nagios has to do. If you cannot migrate to CEE you should increase the check intervals.

An alternative would be to split the site into multiple sites (even on the same server). This way you have multiple Nagios cores running each with a lower number of hosts to check. This is quite easy with checkmk’s distributed monitoring.

5 Likes

Thanks @r.sander these are also good options to address this problem.

For now we just adjusted our configuration and the monitoring is running fine and stable again now.
No more host flapping and no CPU spikes.

If someone has the same problem this is our configuration now which fixed the problems:

/etc/nagios/nagios.cfg
service_check_timeout=150
host_check_timeout=120

Wato Global Settings
Agent TCP connect timeout: 30
Staleness value to mark hosts / services stale: 30
Apache process tuning: 96

Rules
Maximum number of check attempts for host: 6
Normal check interval for host checks: 5 minutes
Retry check interval for host checks: not set (Default value)

Maximum number of check attempts for service: 3
Normal check interval for service checks: 10 minutes
Retry check interval for service checks: 2 minutes

Timing settings for SNMP access
Timeout: 45 sec
Retries: 2
Fetch intervals for SNMP sections: 120 minutes

Periodic service discovery: Do not perform periodic service discovery check

Ping and host check parameters:
Ping

This was adjusted like the Nagios Tuning Guide recommended:
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html
Nr. 11

This is how the difference of the CPU utilization of the CheckMK Debian VM looks like:

5 Likes

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.