No new Alerts for second service failing in systemd service summary check

bart.versluis · April 8, 2024, 11:44am

2.1.0p10.cme
RHEL8 & 9 | Ubuntu 20.4

Hi all, due to monitoring many sites for different customers we make use of email notifications and the opsgenie plug-in in order to receive alerts for our services.

It occurred to me that as soon as one service is failed on the systemd service summary check, the check stays in critical state. Therefore we do receive no new alerts while addressing the first problem. I was wondering if there is a way around this, that does not include managing every single systemd service as its own check. That could be a solution, but would require maintenance every time a new service is added.

Does anyone have any experience addressing a similar issue?
Thanks in advance,
Bart

ChristianM · April 8, 2024, 1:41pm

Hi.

This is an operanional problem. checkmk read the information from systemd and if there is an issue it will be shown. You need to reset the systemd information on the target system. To do that, you need to run “systemctl reset-failed”.

RG, Christian

bart.versluis · April 8, 2024, 1:55pm

Hi Christian, thanks for your reply.

My problem lies more in how notifications are generated, which is by change of the state of the check. In this regard your solution does in fact work by resetting the services from failed to stopped. This would mean there still is a problem that needs solving, but that would not be visible any more. This is more of an undesired state in comparison to not receiving new notifications.

I have played around a bit by excluding the failed service by regex, a function which is built in, however this too results in a monitoring state that does not reflect the actual state of the system.

Do you have any suggestions on getting more notifications even whilst the check remains in critical but gets additional parameters?

LaSoe · April 8, 2024, 3:14pm

Hi Bart,

you also want checkmk to trigger an alarm if other systemd services fail in the meantime and are added to the list of failed services. Unfortunately, checkmk cannot handle cases where the cause of the alert changes (e.g. if another/additional systemd service fails) but the status does not change.

Basically, in such a case Checkmk would have to reset the service internally to OK and immediately back to Critical, which would then result in a corresponding alarm.

Perhaps you could use an alert handler to reset the service to OK when the list of services changes. Then checkmk would alert again the next time it is run if the problem persists, not a nice solution but at least one that alerts you.

RG, Lars
Ps. That would be something for the ideas portal. You’ve already got my vote