Down hosts but they are not and not registered agents

CMK version: 2.2.0 p26
OS version: Ubuntu 20.04

Hi!

We have a problem since this night with some Agents and some Hosts.

A collegue created a rule for Network interface and switch port discovery which led to Service discovery detecting ALL network interfaces with another name. Since this was not wanted, I disabled the rules and let the system heal itself. We waited several hours but some things are still here.

The follwoing issues came alive in parallel with creating the rule above (which is now disabled). I dont know how they are combined, maybe someone can shed a light.

First issue: Two down hosts, but they are not:
CMK shows in its WebGUI for both: No IP packet received for 15.587090 s (deadline is 15.000000 s).

A cmk -vvv hostname shows:

[agent] Success, [piggyback] Success (but no data found for this host), execution time 2.1 sec | execution_time=2.080 user_time=0.010 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=2.070

No issue. The cmk output is different from the webgui output. The Connection Tests also show NO issues at all:

Any ideas?

The second issue are many Hosts, that shows since the same period now:
[agent] Agent controller not registeredCRIT, [piggyback] Success (but no data found for this host), execution time 0.0 sec.

But a cmk -vvv hostname again, shows everything is ok?

[agent] Success, [piggyback] Success (but no data found for this host), execution time 2.7 sec | execution_time=2.720 user_time=0.010 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=2.700

Why the difference?

Any idea?

I already rebooted the CMK server itself.

Created ticket because other issues popped up. I’ll share the results later.

We have distributed monitoring.

It turned out, that - because of whatever reason - the hosts changed its responsible site to another (but the GUI config shows the correct values). I noticed this within the history graphs: they were reset since the first occurence of this issue. With that in mind, I opened the Main folder and simply saved it. That reordered everything for now.

Reason unknown yet but I continue tests…

Okay, we got the cause for that behavior.

We have several Administrators. However, few of them have specified “Authorized sites” => Specific.

If now such Admin applies config changes that affect not only their site, everything screws up. Many (if not all) hosts change the site to their authorized site even if they are configured to use the main site.

If I (admin with all sites perms) simply save the Main folder (write config to all sites again) everything changes back to normal.

This findings are also reported to the Support (SUP-19659).

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.