we have recently upgraded our check_mk from 1.4.0P8 to 1.6.0P11
all is running good but i have a strange behaviour : all services are
marked as “This service is stale, no data has been received within the
last 1.5 check periods” and, values of perdata and graph are good !
Is this on all your services?
What version do you use - Raw or Enterprise?
If it is the Raw edition and your system is bigger, it can take up to 30 minutes after a core restart that all stale services are gone.
Enterprise should fix this after 2-5 minutes.
@andreas-doehler Yes we are using Raw edition 1.6.0P11 and we monitoring around 1900+ hosts . if we enable distributed monitoring will it helps us to sort this issue ?
The problem is the old Nagios core. If this core runs without restarts most is find but if you restart often or it is down a longer time then the complete schedule is invalid and the core needs to reschedule all checks. The default time horizon is if i remember it correctly around 30 minutes.
Distributed monitoring can help you in this case also as you have smaller instances. These smaller instances can restart quicker, config generation takes not so much time and so on.
But this has not directly something to do with the stale problem.
If you monitor around 2k hosts i would split the system in around 4 or 5 instances - every instance with 400-500 hosts. This should perform way better also on one big hardware system beneath.
I have 2 or 3 such systems running with big multiple sites on one hardware and it is better than one big system.