Strange behaviour livestatus query statehist duration

Hello everybody,

I am trying to use the livestatus API to create a daily report of how many seconds a service has been in OK, WARN or CRIT state for the last 24 hours. For some reason I cannot determine livestatus sometimes gives me the amount of seconds the service has been OK, WARN or CRIT since midnight instead of the last 24 hours.

I have reproduced this issue on both a clean installation of Check_mk raw 1.6.0p18 on Red Hat Enterprise Linux 8.3 using the RPM installation AND the Check_MK Docker image.

To determine the time window in UNIX time format I use the following in BASH:

# Size of the window in hours
historywindow=24

# Current time in UNIX time format (end of the window)
currenttime=$(/bin/date +%s)

# The time $historywindow hours ago in UNIX time format (beginning of the window)
historytime=$(($currenttime-(($historywindow*3600))))

This works fine and (($currenttime-$historytime)) returns 86400 (24 hours). So far so good.

I run the following query using the lq binary as the user the OMD site is running as:

lq "GET statehist\nColumns:host_name service_description\nFilter: time >= $historytime\nFilter: time < $currenttime\nFilter: name = $host\nStats: sum duration_ok\nStats: sum duration_warning\nStats: sum duration_critical\nOutputFormat: json\n"

I am monitoring only a single host of which all services have been OK since I started monitoring it. Using the Check_MK GUI I can see the history and graphs for the services showing an uptime over 24 hours.

I am running the lq query every minute and logging the output to a logfile. At some point the “sum duration_ok” goes from being on 86399 (24 hours - 1 second) for a long time to (in below example) 40802 (amount of seconds since midnight).

My logging of the point where it changes looks like this:

Wed Nov 11 11:19:03 CET 2020
currenttime: 1605089943
historytime: 1605003543
currenttime-historytime: 86400
$lqpath "GET statehist\nColumns:host_name service_description\nFilter: time >= $historytime\nFilter: time < $currenttime\nFilter: name = $host\nStats: sum dur
ation_ok\nStats: sum duration_warning\nStats: sum duration_critical\nOutputFormat: json\n"
[["my-host.local","",86399,0,0],
["my-host.local","CPU load",86399,0,0],
["my-host.local","CPU utilization",86399,0,0],
["my-host.local","Check_MK Discovery",86399,0,0],
["my-host.local","Check_MK",86399,0,0],
["my-host.local","Disk IO SUMMARY",86399,0,0],
["my-host.local","Filesystem /",86399,0,0],
["my-host.local","Filesystem /boot",86399,0,0],
["my-host.local","Interface 2",86399,0,0],
["my-host.local","Interface 3",86399,0,0],
["my-host.local","Kernel Context Switches",86399,0,0],
["my-host.local","Kernel Major Page Faults",86399,0,0],
["my-host.local","Kernel Process Creations",86399,0,0],
["my-host.local","Memory",86399,0,0],
["my-host.local","Mount options of /",86399,0,0],
["my-host.local","Mount options of /boot",86399,0,0],
["my-host.local","Number of threads",86399,0,0],
["my-host.local","Systemd Service Summary",86399,0,0],
["my-host.local","TCP Connections",86399,0,0],
["my-host.local","Temperature Zone 0",86399,0,0],
["my-host.local","Uptime",86399,0,0]]

Wed Nov 11 11:20:03 CET 2020
currenttime: 1605090003
historytime: 1605003603
currenttime-historytime: 86400
$lqpath "GET statehist\nColumns:host_name service_description\nFilter: time >= $historytime\nFilter: time < $currenttime\nFilter: name = $host\nStats: sum dur
ation_ok\nStats: sum duration_warning\nStats: sum duration_critical\nOutputFormat: json\n"
[["my-host.local","",40802,0,0],
["my-host.local","CPU load",40802,0,0],
["my-host.local","CPU utilization",40802,0,0],
["my-host.local","Check_MK Discovery",40802,0,0],
["my-host.local","Check_MK",40802,0,0],
["my-host.local","Disk IO SUMMARY",40802,0,0],
["my-host.local","Filesystem /",40802,0,0],
["my-host.local","Filesystem /boot",40802,0,0],
["my-host.local","Interface 2",40802,0,0],
["my-host.local","Interface 3",40802,0,0],
["my-host.local","Kernel Context Switches",40802,0,0],
["my-host.local","Kernel Major Page Faults",40802,0,0],
["my-host.local","Kernel Process Creations",40802,0,0],
["my-host.local","Memory",40802,0,0],
["my-host.local","Mount options of /",40802,0,0],
["my-host.local","Mount options of /boot",40802,0,0],
["my-host.local","Number of threads",40802,0,0],
["my-host.local","Systemd Service Summary",40802,0,0],
["my-host.local","TCP Connections",40802,0,0],
["my-host.local","Temperature Zone 0",40802,0,0],
["my-host.local","Uptime",40802,0,0]]

I am unsure why this is happening. One minute everything is fine and a minute later livestatus query seems to be substituting my 24hour timewindow with hours since midnight. Am I doing something stupid? I hope somebody can shine a light on this behaviour.

Kind regards,
Stephan

Would think this might help avoid some unnecessary calcs…

date +%s -d’now - 1 day’

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.