Service graphs show non-value value when Host graph show a downtime

CMK version:
Check_MK version 2.1.0p31 CRE

OS version:
Debian bulleyes

We had a incident on physical host during 30 minutes. The server was still up but seemed to be stalled.
During that period, Host graph is consistent : RTT is null and %packet lost is 100%
But services graphs show plateaus of constant non-null values, which is weird and confused technicians in their troubleshooting.

When I check another server which was shutdown to proceed a maintenance, all graphs show null value during the shut down.

Is there any explanation ?
Should we change something in our configuration ? Look somewhere else ?
Thx

Hi @fledorze,

if the Checkmk server still can get data from the agent, it could be possible that this is your explanation. Because the Host check has nothing to do with the agent. (In the default setup)
The default behavior of the host checks are to do ping checks while your Checkmk agent is polled via TCP port 6556.

Can you maybe share the graphs here to understand your issue better?

Thanks in advance!
Norm

Here are some graph screenshots :
Ping : image

Memory: image

SSH : image

Hi @fledorze,

thanks for the screenshots.

The behavior is completely normal. The services are passive services that are getting their data from the Checkmk agent. In the time of the downtime, the Checkmk agent was not able to get new data and the last known data was presented. But the monitoring shows a cobweb icon that the services are stale, and the data may be outdated.

You can read more about the core principle here:

The Checkmk warned at multiple points about this.

  1. The host was down
  2. The Checkmk Service should have been Critical because it timed out.
  3. The Cobweb icon and the gray services because there were stale and had outdated data.

Hope this will explain the situation a bit better.

Regards
Norm

Hello Norman
Thanks a lot for your answer.
The logs show exactly what you say:
HOST ALERT and SERVICE ALERT for Check_MK and SSH. OK
I understand your explanation for passive services provided by agent.
But it does not apply to SSH check, which is an active check but for which we have the same plateau.
What do you think ?

I did rrdtool dump of RRD files and CheckMK really filled them with the last known value till the end of the downtime, but I think it should not.

Hi @fledorze

It is the same for the SSH Check. That’s correct, it is an active check, but at the time of the down Host the SSH Check was also no able to gather data from the DOWN host. The service went critical as it should, and there was no performance data retrieved that could’ve been updated.

Just putting a 0 in case of no data would make things even worse in my opinion because that would be wrong data. Checkmk just uses the last known data until new data is retrieved and is alerting the user in numerous ways that there is something that should be looked into.

What would be your suggestion to this issue? :slight_smile:

Thanks in advance!
Norm

Indeed, in RRD file, it should be NaN, see example below. That is the behaviour I experimented with naemon, a nagios fork.
As explained on RRDtool - rrdinfo, NaN stands for UNKNOWN data, which is the truth.

                    <!-- 2023-10-03 22:00:00 CEST / 1696363200 --> <row><v>1.8000000000e+01</v></row>
                    <!-- 2023-10-03 22:30:00 CEST / 1696365000 --> <row><v>1.8000000000e+01</v></row>
                    <!-- 2023-10-03 23:00:00 CEST / 1696366800 --> <row><v>1.8000000000e+01</v></row>
                    <!-- 2023-10-03 23:30:00 CEST / 1696368600 --> <row><v>1.8000000000e+01</v></row>
                    <!-- 2023-10-04 00:00:00 CEST / 1696370400 --> <row><v>NaN</v></row>
                    <!-- 2023-10-04 00:30:00 CEST / 1696372200 --> <row><v>NaN</v></row>
                    <!-- 2023-10-04 01:00:00 CEST / 1696374000 --> <row><v>NaN</v></row>
                    <!-- 2023-10-04 01:30:00 CEST / 1696375800 --> <row><v>NaN</v></row>
                    <!-- 2023-10-04 02:00:00 CEST / 1696377600 --> <row><v>NaN</v></row>

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.