Why are there so many stale services?

CMK version 2.3.0p22:
**OS version:**RHEL8

Our environment with 1758 hosts shows 24,000+ stale services at any given time. I’m having a hard time understanding if or why this information is relevant, or even accurate.
Isn’t checkmk supposed to be constantly checking in and getting status on monitored services?
When I click on that number of stale services, and it takes me to a compiled listing of servers that presumably, have the stale services i see under the “age” column dates as far back as 3-6 months ago, and under the checked column - time frames of with 7 seconds.

I guess my primary questions are:

  1. is this list of stale services actually correct at all, and just showing inaccurate, old irrelevant data?
  2. Is there a way to clear this data, or refresh?
  3. How should this data (stale services) be interpreted? Management sees it and thinks omg - why are there 24k+ problems or old data in the overview?

Hi @ddobek

if a service is stale this means that Checkmk can not get up-to-date information on this service. Read more here: Basic principles of monitoring with Checkmk - Understanding Checkmk terms

In that case, there is most likely something wrong with your data source (agent, or SNMP). So you have to go and see what that problem is? Can agents be reached? Are the plugins on the agent working, etc.

  1. is this list of stale services actually correct at all, and just showing inaccurate, old irrelevant data?

If your Checkmk is configured properly, the list should correctly show the number of currently stale services. The fact that there’s 24k of them means there is something seriously wrong somewhere

  1. Is there a way to clear this data, or refresh?

Well, that should happen automatically, once Checkmk receives up to date into from the hosts in question. The fact that this is not happening means there is something seriously wrong somewhere

  1. How should this data (stale services) be interpreted? Management sees it and thinks omg - why are there 24k+ problems or old data in the overview?

They should be interpreted as it says above: Checkmk can’t get up-to-date info. That doesn’t mean that the stale system has a problem - it might be totally fine - Checkmk just can’t see it. I’s like when you’re driving with a foggy windshield. You might be going down the road just fine. But you won’t see the oncoming tree. So you really need to get to the bottom of why these services are stale. Looking at the data source (the Check_mk services on the affected hosts) should give you a first clue…

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.