CheckMK Raw 2.0.0p9 (CRE) - Activating Changes Causes Hosts To Show As Down

AnthonyWingerter · September 1, 2021, 10:18pm

I’m a bit stuck here. Almost any time we activate changes in CMK Raw 2.0.0.p9 we get a number of hosts that will show as down with a “Null” summary. The number of hosts that will show this ranges from 1 - 100+

I’ve not found any performance or configuration issues on the monitoring sites/hosts.
We have 6 total sites with around 1400 total hosts.

The agents on all hosts have been updated to match the server version.

Any idea where I should start looking? I would appreciate any help.

andreas-doehler · September 2, 2021, 6:31am

There are some points you can check.
The problem hosts shown belong these to the same site every time or is it distributed over your complete infrastructure?
Next check - what is used as the host status? Is it a normal ping or do you use the status of the Check_MK service or something else?
As i see no metric icon be some of the hosts i think the host check command is the problem.

dns_es · September 2, 2021, 8:47am

Hi @andreas-doehler, what is best practice or recommended?

andreas-doehler · September 2, 2021, 9:51am

I would say there is no real best practice. The problem i saw at some 2.0 installations was that a Check_MK service goes critical with output “null” if it is checked directly at the activation time.
At the next check interval it is Ok again. If you use now the status of the Check_MK service as host status then also you host has the chance to be critical at activating changes. You can also say more hosts equal higher chance to have some critical hosts.

I only use service states as host states if i cannot ping the host.

jairelee · September 2, 2021, 3:58pm

Hi,
I have one main server and six other Check MK proxies for different regions.
I have just tested that the (null) issue is extended to the other monitoring proxies.
When I apply a change that affects devices monitored by other proxy nodes I’m also getting the null summary for devices from that remote proxy.
If apply changes affecting to my main server and devices from other proxies, I see this problem for devices from the main server and the affected proxies.
Although, I use ping to monitor the host status and I’m not getting new hosts down alerts, just the Check MK service summary (null) issue.
I use Check MK 2.0.0p9 CRE.
Regards.

andreas-doehler · September 2, 2021, 7:39pm

That’s the same as on the system i saw it the first time.
It was also with p9 CRE

kavra · September 6, 2021, 5:58am

Good morning, we migrated from 1.6.p19 to 2.0.0p4 and we also have this problem (the hosts are all ping)
Currently we are with a master and 4 slaves all in CRE version 2.0.0p8 and the problem continues, it is quite annoying

andreas-doehler · September 6, 2021, 6:33am

What problem exactly? If you use ping you should not have the host down problem but the Check_MK service shown with output “null” or?

kavra · September 6, 2021, 6:42am

Hello, the problem is exactly the same as @AnthonyWingerter, when applying the changes from the MASTER the hosts (not all, it is random) of the SLAVE remain in null until it redoes the check. All hosts are pinging.
This has never happened in 1.6, it has been happening since we migrated the infrastructure to 2.0, it is not serious, since it does not launch notifications, when giving the next check ok, but it is quite annoying.

geppo · September 6, 2021, 1:58pm

Hi, I’ve the same problem with 2.0.0p9 (never happened with deploy @ 1.6 version). The unique difference is that it’s not an “host down” alarm but a “service critical alarm”.
I made many tests looking at htop output and I can confirm that null otuputs come when activation overlaps with host and service check.
In our enviroment also “periodic/bulk service discovery” trigger theese errors.

Note that, at the beginnig, we had only one site for (500 host/ 6000 services) and we decided to split to a multisite environment because of the “null issue” but in this deploy it was triggered by “periodic service discovery”, not by changes activation.

now, in the multisite distribuited deploy (4 site with 8/8/8/16 core) the issue is triggered mainly by changes activation, rather than by “periodic service discovery”

kavra · September 6, 2021, 2:21pm

Sorry, thanks @geppo !! is in service CRIT me too.

AnthonyWingerter · September 7, 2021, 2:03pm

Thanks all. This issue also did not occur for for in v1.6.
This issue also triggers host down notifications for us.

This seems to be a bug in the current version.
Is there a method of opening up a bug report to the CMK team for this issue?

Thanks and regards,
-Anthony-

dns_es · September 13, 2021, 9:00am

Report a bug…

by mail: feedback@checkmk.com

AnthonyWingerter · September 16, 2021, 1:53pm

Thanks! I’ve submitted a bug report via email today.

geppo · September 17, 2021, 5:31pm

Great|! If you get any feedback please advice all of us !

harikk81 · November 3, 2021, 10:26am

I’m facing the exact same issue after upgrade to 2.0 from 1.6, did any of you guys have any luck figuring it out, please ?

AnthonyWingerter · November 4, 2021, 7:19pm

We are still facing the issue in 2.0.0p9 CRE. I reported the issue, but never got any type of response. I’m hoping this is addressed in a newer release 2.0.0p15 has been released today. I’m hoping to upgrade soon to determine if this issue has been resolved.

nugzarg · November 8, 2021, 5:09pm

I reported this issue as soon as first stable version of 2.0 was released. I never received any kind of answer to this report. In current check_mk RAW version (2.0.0p15) the issue still exists. It seems that check_mk guys are not interested to release stable free version which can be used to monitor infrastructure with more than 30-40 hosts. In other words, if you have to monitor infrastructure with more than 30-40 hosts (and you don’t want to use an alternative), you have two options:

Buy paid version
Use old stable version 1.6

geppo · January 19, 2022, 10:37am

it is still present in 2.0.0p18, that has also add big problem in distribuited enviroments (there is a problem in TLS enc between server that slow down GUI )

gulaschcowboy · February 7, 2022, 12:36pm

Hi @geppo. “there is a problem in TLS enc between server that slow down GUI” - this should be fixed in p19.