All services of one host stale

Hi, I’m still evaluating checkmk, so I’m adding more and more hosts right now. One thing occured today: a productive server (MS Server 2019) lists all services (eg CPU, Disk IO) as stale on the web GUI - besides ‘Check_MK’ and ‘Check_MK Discovery’, they work fine.

After executing ‘cmk -v >>fqdn<<’ on my checkmk-Server, all services are up to date, even on the web GUI. Then they go stale again through the automatic check intervall.

And on a side note: I’m already monitoring the more or less exact configured host with the same cmk-config in my test environment and checkmk works as intended.

Any idea, why this is happening?

Thanks in advance!

Are you working on CRE or CEE version? STale means that the services couldn’t bee check within your defined check period.

If you are on CRE you can activate the micro core statistics and check the latencies and helper usage. If these values are high you probably need to adjust your check helpers in the general settings.

Another possible behavior could be the response and parsing of agent output which need more than your configured check period. This could happen on systems with a large output and a large number of services to check.

Hi, thanks! I forgot: I’m working on CRE. I will keep your advice in mind, but for now, the problem seems “resolved”. What I did in a first approach: I removed the newly added host via WATO and added it again. Now everythings works fine. The only thing that worries me with this approach is, if this error reoccurs, it’s definitely not suitable to delete all the monitoring data we gathered up to that point. :confused:

Is it possible that there was some sort of hiccup in the config after I added the host via WATO (Web GUI)?

Up to how you defined the housekeeping in the global settings the historical data of this host is kept. So you don’t lose the already gathered data if you recreate a host after such a problem.

With the CRE the system works with the classic nagios core which isn’t that performantly like the CMC. On CEE the system spawns helper processes which help the core parsing monitoring data and performing active checks to overcome the problem of performance on larger systems and hosts with a large amount of services.

On a windows system this staleness could be related to the microsoft event log which could transfer a large amount of data within a check period. In most cases it’s a good idea to deactivate such checks for irrelevant informations.

2 Likes

Thanks for the tip. How exactly do I set this? In ‘automatic disk space cleanup’ I only find the following information:

During monitoring there are several dedicated files created for each host. There are, for example, the discovered services, performance data and different temporary files created. During deletion of a host, these files are normally deleted. But there are cases, where the files are left on the disk until manual deletion, for example if you move a host from one site to another or deleting a host manually from the configuration.
The performance data (RRDs) and HW/SW inventory archive are never deleted during host deletion. They are only deleted automatically when you enable this option and after the configured period.

This is the houskeeping i mentioned. It means that after the time you have configured the system will delete historical data to this host which was deleted like performance data, inventory data, cached informations, snmp walks, etc.

The explanations seems a bit confusing to me, because the performance data is kept after normal deletion of a host.

2 Likes

Thank you!

An observation, in addition to what I previously wrote: every time a new host gets added, a lot of services on already successfully monitored hosts randomly go stale for some time. My wild guess: there can occur problems with the host-configuration on the checkmk-server after adding new hosts?!

Ah, if you are going to activate your changes in Wato your monitoring core is going to be restarted. In this time no service could be checked and your system is trying to do all expired checks after the core is online again. Depending to the time your core needs for restart your service check period expired and the services become stale. This is just for a short time until all services are checked within one or two check periods.

Hi, yeah, exactly. :slight_smile: But sometimes, its up to 8min and more. Earlier today some services of an unaffected host went stale for as long as it took my colleague to manually(!) start a rescheduled check of the check_mk service on the newly added host.