CMK version:
2.0.0p23
OS version:
Ubuntu 20.04.4 LTS
Error message:
Backup completed, it was running for 5 m from 2022-05-15 23:02:01 till 2022-05-15 23:07:04, Size: 210.69 MB, Next run: 2022-05-16 00:02:00CRIT
Every time the checkmk backup runs on our site we get the above status: The backup itself seems to have completed so I wouldn’t expect the CRIT status; viewing backup log (Setup > Maintenance:Backups > icon for “Show current / last state of this backup job”),
2022-05-15 23:02:01 --- Starting backup (Check_MK-ssfdb5-sheffieldsolar-checkmk_backup0 to checkmk_target0) ---
2022-05-15 23:07:04 Verifying backup consistency
2022-05-15 23:07:04 Cleaning up previously completed backup
2022-05-15 23:07:04 --- Backup completed (Duration: 0:05:03, Size: 210.69 MB, IO: 740.66 kB/s) ---
(we can also see the resultant backup file in the configured backup directory) but despite this checkMK leaves the backup in a CRIT alert state.
I notice a few other posts in this forum that may be the same problem; their threads all look to have automatically closed before finding a resolution to the issue posted (such as CRIT - Backup completed or Check_Mk backup daily : it has succed in the backup section but the plugins says it has failed)
The site is set to the Europe/London timezone, currently in BST (UTC+1).
I speculate that the issue is something to do with time arithmetic or comparison. Backups are currently set for 00:02 every day. Note how checkmk’s report describes the backup as running for 5 m from 2022-05-15 23:02:01 till 2022-05-15 23:07:04, but then mentions the Next run: 2022-05-16 00:02:00. The former are the UTC times, the latter is a BST (UTC+1) time. The latter is in fact not the start of the next backup, but the start of the backup just completed (e.g. 2022-05-15 23:02 UTC == 2022-05-16 00:02 BST), which will lead mkbackup to think that the next backup is due to start before ‘now’ (more explanation below).
Looking at the program reporting the backup results (and I believe triggering the CRIT status) in /opt/omd/sites/sitename/share/check_mk/checks/mkbackup
, the lines 92-101 look particularly relevant,
next_run = job_state["next_schedule"]
if next_run == "disabled":
yield 1, "Schedule is currently disabled"
elif next_run is not None:
if next_run < time.time():
state = 2
else:
state = 0
yield state, "Next run: %s" % get_timestamp_human_readable(next_run)
so assuming that a state
of 2 equates to a CRIT, this can be caused by next_run
being set to the time of the just-completed backup rather than the next one. I suggest that this could be why these erroneous CRIT alerts occur. next_run
comes from job_state["next_schedule"]
but I haven’t fully understood where job_state
comes from yet.
I tried digging further and downloaded the source of check_MK but started to get lost reading the code; cmk/utils/schedule.py
was looking like it might be relevant.
Update: I’ve found that the server is set to Europe/London in /etc/timezone (so BST) but UTC in $OMD_ROOT/etc/environment, so if check_MK is doing any date-time checks ‘naively’ - i.e. without considering the timezone, or without having normalised date-times to (say) UTC - then this might explain the behaviour in case the dates being compared had different timezones connected to them. If so that would still be a bug.
Any thoughts? Anyone else seeing this issue?