Is it possible to auto acknowledge problems (specific offline hosts)?

mgutt · January 17, 2023, 4:11pm

I am monitoring some hosts which aren’t online all the time. But it should be directly visible if they have the state DOWN, so I don’t want to add a rule which sets them permanently to UP. Instead I like to avoid producing problems which needs to be acknowledged.

Is it possible to “Auto-acknowledge all host problems of hosts …” or something like “Ignore problems of hosts …”?

At the moment I solved it by setting them into a 10 year downtime. This works, but does not feel right.

gstolz · January 18, 2023, 10:41am

Hi,

so should the hosts ever notify anyone or is the goal that they be set to auto-acknowledge/downtime anyway?

the 10 year downtime doesn’t sound too wrong for me so far
Gerd

mike1098 · January 18, 2023, 4:30pm

I never tested that:

You can add a time period 00:00 - 00:00 and the use it in the rule Notification period for hosts.

May you give it a try and let us know

Anytime you can exclude your hosts in the notification rules but I guess you know that

mgutt · January 19, 2023, 7:33am

No

Yes. The target is: “Ok there are hosts DOWN, but don’t treat them as an unhandled problem.”

mgutt · January 19, 2023, 7:54am

I tried this, but it’s still treated as a problem although it is now “Out of notification period”:

Translating the state to “UNREACHABLE” doesn’t work, too:

“Out of service period” doesn’t work, too:

mgutt · January 19, 2023, 9:30am

I’m not sure, but I think I found the correct way, but it is interpreted wrong by CheckMK:

I changed the “Normal check interval for host checks” to 44 seconds
I changed the “Maximum number of check attempts for host” to 25000

By that every 110 seconds (44 seconds x 2,5 intervals) a Host State Change is triggered and as the maximum is 25000 its host state changes from “SOFT (DOWN)” to “HARD (DOWN)” after around one month (25000 seconds x 110 seconds = 31.8 days).

By that the host check attempt counts upwards as expected:

But although the Host state is not “HARD (DOWN)” it is treated as a problem:

I mean why is a “SOFT (DOWN)” host a problem which needs action by the user? Sounds like a bug to me.

mike1098 · January 19, 2023, 9:47am

Its the standard Nagios behavior. First its an soft alert and after reaching max check attempts it change to a hard state. A hard state trigger a notification.

Thats what it is doing. What are you missing?

Yes, you can do this with an alert handler and livestatus.

Can you explain in detail what you expect? What do you want to “ignore”

mgutt · January 19, 2023, 10:22am

Nothing. I only wanted to be clear that this is the desired state and I don’t want to translate it through a rule to always UP.

CheckMK should not treat a DOWN host as a problem. Not sure how to explain it different ^^

Or maybe the workflow as an example:

Step 1: Check if there are any unhandled problems

Step 2: Check problems, solve them or acknowledge them with a comment

Yes, but the soft state already triggers an unhandled problem, which I think is incorrect.

Hmm… So I need to create a BASH script which sends a request to the CheckMK API? Wouldn’t be my favorite solution as not doable by all team members.

mike1098 · January 19, 2023, 10:35am

OK, better understand you now.

In the dashlet Host Statistics you can add a filter to show only hard states:

Just make a clone of your main dashboard and share it with your colleagues.

I am currently dont know the way to modify the Host Statistics in the sidebar.

mgutt · January 19, 2023, 11:06am

Ok, this means I could replace the default “hostproblems” view with a custom view which contains an additional filter. If this is possible, it’s worth a try.

Hmm, this could cause confusion in the team “Ok, there are still problems left … loading list … no problems found”. “Hey mgutt, solve this bug in the sidebar”

Do you now confirm my opinion regarding “the soft state already triggers an unhandled problem, which I think is incorrect”? Then I would open a feature request regarding this.

mike1098 · January 19, 2023, 12:49pm

As a remedy you may disallow this sidebar snapin in the user role. This way you force the user to use your custom dashboard. Maybe another community member knows a way to customize the snapin.

mike1098 · January 19, 2023, 12:52pm

No I do not agree.
The behavior is exactly what we have since more than 20 years in nagios and the community is used that
way. Nevertheless in checkmk you can customize the views and dashboards to make it work as you expect.

martin.schwarz · January 20, 2023, 10:45am

I’d say … instead of fiddling with views, periods, soft/hard states, etc.:
just put these hosts in an “everlasting” downtime and all is fine. keep it simple.

On Enterprise edition, instead of a downtime for the next say 10 years, you could also create a “recurring downtime” rule for these hosts with say downtime for 1 month, repeating monthly.
Same result, but you (or your future colleagues) won’t have a surprise in 10 years from now in their then “legacy” monitoring system

tosch · January 20, 2023, 12:56pm

From my perspective i would just set disabled notification to this host/service and exclude them from the default view of host/service problems. Nearly as simple as a endless downtime but more the right way in my opinion. This also enables you to search for these host/services especially via some host tags. I guess they are handily built-in, if i am not wrong.

system · January 20, 2024, 12:56pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.