Why do i have question mark for certains service notifications?

CMK version: 2.4.0p14
OS version: Debian 12

hello,

Why do I have some service checks showing a question mark inside the event field?

This service check isn’t new, as you can see in the graph. I’m not sure what to investigate to resolve this issue. All I can say is that I’ve been experiencing this problem since upgrading to version 2.4, and edit the option `maximum number of check attempts for service from 1 to 3.

here is the log from notify.log :

2025-11-05 01:05:56,046 [20] [cmk.base.events] Previous service hard state not known. Allowing all states.

i have this problem on mutiple instances of checkmk running at the same version and config, and the problem appears for the same service check : filesystem /var/log

Hi.

Looks like that is the first status change which raise this alarm. The log say no previous server state before.

RG, Christian

Hi @ChristianM

Unfortunately, I receive this notification every day:

These are all the notifications from the last two days on the same host.

The service goes critical at 9 a.m., then changes from critical to warning at 12 p.m. due to log rotation. I don’t understand why Checkmk is unable to retrieve the previous status.

Do you use RAW edition with Nagios core or Enterprise with CMC core?

If it is Nagios core then it can have to do with not correctly saved state files and the core not knowing the previous state.

I’m using the raw edition, so with nagios Core

How can I check if the previous state is saved ? Where is this file ?

Cheers

Another example on a different host, for another service:

There is definitely something going wrong :frowning:

The file is the nagios state file that it saved at clean restarts, log file rollover and from time to time.
At this times a complete state dump happens for all hosts and services. If it is a hard state then his is normally the hard state used by the system.

The configuration of this behavior is done inside the “retention.cfg” file. But if no one touched it before it should have the right configuration entries.

What also can happen is that the state changed from OK –> WARN –> CRIT but had no hard state at WARN that then at the notification for CRIT you don’t get a hard state. Also if the service was flapping before it can happen i think.

2 Likes

Bingo, I can confirm that this is indeed the problem.

Let me summarize the situation:

when maximum number of check attempts for service is set higher than 1, I am able to reproduce this issue on any instance (using version 2.4; I haven’t tested older versions yet).

For services with warning and critical thresholds, if the state switches from OK → WARN → CRITICAL without having a HARD state recorded for each, this problem occurs.

However, is this the intended behavior? In my opinion, the notification should show OK → CRIT, i.e., displaying the last two known HARD states.

Any updates please ?

still searching for help :frowning:

I checked on one of my RAW systems and there i saw that the whole problem with the last hard state is a little bit more complicated as the value is calculated inside the notification system. The Nagios core directly does not report such a value.

The CMC core does send this macros directly. For Nagios they are calculated in this function here.

        if (
            enriched_context["WHAT"] == "SERVICE"
            and "PREVIOUSSERVICEHARDSTATE" not in enriched_context
        ):
            prev_state = enriched_context["LASTSERVICESTATE"]
            if prev_state == enriched_context["SERVICESTATE"]:
                prev_state = "OK"
            elif "SERVICEATTEMPT" not in enriched_context or (
                "SERVICEATTEMPT" in enriched_context and enriched_context["SERVICEATTEMPT"] != "1"
            ):
                if raw_context["SERVICESTATE"] != "OK":
                    prev_state = "?"
                logger.info("Previous service hard state not known. Allowing all states.")
            enriched_context["PREVIOUSSERVICEHARDSTATE"] = prev_state

And here i see already the bug for the ? (UNKN) service state.

The code part the sets the prev_state to ? makes no sense. But the problem is that the Nagios core does not sent any information about the last state if you use multiple check attempts. Here it is better to sent directly at the first attempt and then use the “delay service notification” rule to define a time like it was done before with the check attempts.

Overall it is complicated :wink:

Thanks for your answer , @andreas-doehler

So if I understand correctly, it is better to configure “delay service notification” instead of using check attempts with soft states?

I could test this, but in that case it makes this feature useless from my point of view. In which situations is it better to use delay service notification rather than check attempts?

cheers

Normally i would prefer the check attempts as i can also filter dashboards and views for hard and soft state with this method.

But for your use case the delay service notification is better.

In the end i would prefer here some other approach - change the code to something else than the “?”. Something like if it is now critical you select automatically warn as source state or if it is warn you select ok as previous state.

1 Like