I have a few local checks which often go UNKNOWN in the dashboard when the servers workload is high momentarily. I am looking for a way to keep the service from going orange for a specified number of minutes and if it is still UNKNOWN then go ahead and turn orange and send notifications.
Is there a way for this?
pseudo code/logic I’m sort’ve looking for:
If UNKNOWN:
dont_go_orange()
state = check_in_n_minutes()
if state is ‘UNKNOWN’:
go_UNKNOWN_in_dash()
Yea, that’s what I thought, but wanted to be sure since that is a feature we were looking for. Our operations don’t focus much on the email notifications, but watch the dashboard for alerts.
Well, perhaps we can tell checkmk to wait longer for a local check to return data to the server for specific checks instead of going UNKNOWN so quickly?
I guess you could play with caching of the Local checks as described in the documentation.
That way you might get outdated data (slightly), but that might be better than no data depending on your specific use case.
If they only watch the web application you could try and use “Maximum number of check attempts” and then use filters to only show service problems in hard state.