Backup completed... *BUT* (erroneous) CRIT - maybe time-related bug?

CMK version:
2.0.0p23

OS version:
Ubuntu 20.04.4 LTS

Error message:
Backup completed, it was running for 5 m from 2022-05-15 23:02:01 till 2022-05-15 23:07:04, Size: 210.69 MB, Next run: 2022-05-16 00:02:00CRIT

Every time the checkmk backup runs on our site we get the above status: The backup itself seems to have completed so I wouldn’t expect the CRIT status; viewing backup log (Setup > Maintenance:Backups > icon for “Show current / last state of this backup job”),

2022-05-15 23:02:01 --- Starting backup (Check_MK-ssfdb5-sheffieldsolar-checkmk_backup0 to checkmk_target0) ---
2022-05-15 23:07:04 Verifying backup consistency
2022-05-15 23:07:04 Cleaning up previously completed backup
2022-05-15 23:07:04 --- Backup completed (Duration: 0:05:03, Size: 210.69 MB, IO: 740.66 kB/s) ---

(we can also see the resultant backup file in the configured backup directory) but despite this checkMK leaves the backup in a CRIT alert state.

I notice a few other posts in this forum that may be the same problem; their threads all look to have automatically closed before finding a resolution to the issue posted (such as CRIT - Backup completed or Check_Mk backup daily : it has succed in the backup section but the plugins says it has failed)

The site is set to the Europe/London timezone, currently in BST (UTC+1).

I speculate that the issue is something to do with time arithmetic or comparison. Backups are currently set for 00:02 every day. Note how checkmk’s report describes the backup as running for 5 m from 2022-05-15 23:02:01 till 2022-05-15 23:07:04, but then mentions the Next run: 2022-05-16 00:02:00. The former are the UTC times, the latter is a BST (UTC+1) time. The latter is in fact not the start of the next backup, but the start of the backup just completed (e.g. 2022-05-15 23:02 UTC == 2022-05-16 00:02 BST), which will lead mkbackup to think that the next backup is due to start before ‘now’ (more explanation below).

Looking at the program reporting the backup results (and I believe triggering the CRIT status) in /opt/omd/sites/sitename/share/check_mk/checks/mkbackup, the lines 92-101 look particularly relevant,

    next_run = job_state["next_schedule"]
    if next_run == "disabled":
        yield 1, "Schedule is currently disabled"

    elif next_run is not None:
        if next_run < time.time():
            state = 2
        else:
            state = 0
        yield state, "Next run: %s" % get_timestamp_human_readable(next_run)

so assuming that a state of 2 equates to a CRIT, this can be caused by next_run being set to the time of the just-completed backup rather than the next one. I suggest that this could be why these erroneous CRIT alerts occur. next_run comes from job_state["next_schedule"] but I haven’t fully understood where job_state comes from yet.

I tried digging further and downloaded the source of check_MK but started to get lost reading the code; cmk/utils/schedule.py was looking like it might be relevant.

Update: I’ve found that the server is set to Europe/London in /etc/timezone (so BST) but UTC in $OMD_ROOT/etc/environment, so if check_MK is doing any date-time checks ‘naively’ - i.e. without considering the timezone, or without having normalised date-times to (say) UTC - then this might explain the behaviour in case the dates being compared had different timezones connected to them. If so that would still be a bug.

Any thoughts? Anyone else seeing this issue?

On changing the system timezone to match checkmk’s timezone (/etc/timezone, $OMD_ROOT/etc/environment) the erroneous CRIT alerts have stopped, so that looks to be the root of the problem. It does however suggest a bug in checkmk in that it’s probably handling dates “naïvely”, by not taking into account their attached timezone (or discarding this information), so there’s likely to be other places where this could cause unexpected errors (when comparing datetimes derived from underlying system commands and datetimes derived from checkmk calls). Of course on most systems it’s fair to assume that the timezones configured will match, but since the option exists to specify checkmk’s timezone as being different from the system’s timezone it’s disappointing that this can then trigger bizarre checkmk errors.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.