Systemd check alarming activating services

SvenS · August 27, 2021, 10:29am

Hi,

we’ve updated our checkmk raw instance from 1.6.0p25 to 2.0.0p9.

Now we have many alarms from the Systemd Service Summary check, complaining about services activating for too long:

Total: 160
Disabled: 6
Failed: 0
Service 'dnf-makecache' activating for: 10 h (warn/crit at 30.0 s/60 s)**CRIT**
Ignored: 5

All services which are affected have one in common, all are oneshot services which are triggered by systemd timers. My guess would be that checkmk don’t reset the runtime once these services are finished running.
I was not able to get a debug output as the check only fails a single time.

As of now I added these services to the exclude list, but I want to know if these services fail to run, but I also don’t want a flapping alarming until no one cares anymore about the check.

Do I need to live without an alarming for these services?

thorian93 · August 30, 2021, 2:04pm

I think @moritz was working on some systemd related issues, maybe there are already fixes in the pipeline.

moritz · August 31, 2021, 7:04am

I’m (still ) working on issues when deploying the Checkmk agent itself as a systemd service; this is unrelated. Am I right in assuming that the systemd services in question do not have ‘RemainAfterExit=’ set?
From the systemd man page: “Note that if this option [oneshot] is used without RemainAfterExit= the service will never enter “active” unit state […]”.
My guess would be the Checkmk agent only ever sees this unit “activating”.

SvenS · August 31, 2021, 7:40am

You are correct, RemainAfterExit= is not set. As these services are triggered regular via timers RemainAfterExit is not an option to set, as the man page states: “Due to this, services with RemainAfterExit= set (which stay around continuously even after the service’s main process exited) are usually not suitable for activation via repetitive timers, as they will only be activated once, and then stay around forever.”
These affects OS provided services as dnf-makecache from RHEL, or apt-daily from Debian.

thorian93 · September 2, 2021, 6:00am

I think this kind of a stalemate situation. We do not want to change the OS-provided service for obvious reasons, but for monitoring this carries problems.
Maybe it can be made possible to define services that are expected to be in an Activating state @moritz ? This way one could configure it globally for the mentioned services while others would still alert. Also if these kind of services would fail one would also be notified.

moritz · September 2, 2021, 8:59am

Looking at the code: Currently the default state for “activating” is CRIT, and you cannot configure it. That seems quite wrong to me. Additionally: The feature of imposing levels on the time services are activating is pointless. I think you should file a support ticket (or write to feedback@checkmk.com).

thorian93 · September 2, 2021, 9:41am

I disagree: There might be situations where you want to be notified if a service is stuck in “Activating” although it should be started after a certain amount of time. Might be a corner case, but I would not want to see it removed.

moritz · September 3, 2021, 6:25am

Oh, yes! But as long as any activating service is critical anyway, there’s no point in a rule telling the service to go to warning if it is activating for, say, more than 2 minutes (as the criticality will win). This feature only makes sense if “activating” is OK in principle, but not indefinetly…

moritz · September 3, 2021, 12:43pm

I just realized I misread the code. The default for “activating” is not “critical”, sorry. We also just spotted a bug that prevented the correct computation of the time period a service was activating for.

Coming back to your original question, @SvenS: This is particularly unfortunate, as the Checkmk agent service itself will be affected (at least in 2.1)…
We’re planning to make some adjustments.

For now, I think the bugfix will help you. The service is not really in the “activating” state for 10 hours, is it? If you are familiar with local changes to check plugins you can test the bugfix yourself by unindenting the line containing “set_item_state” in “checks/systemd_units” by one level (thus moving it out of the scope of the “for” loop). After that, “activating” should be considered OK when first encountered (on a per service basis). The levels on how long “activating” is considered OK can be configured.

SvenS · September 3, 2021, 1:03pm

The service is not really in the “activating” state for 10 hours, is it?

Correct, the service is activating for seconds, sometimes a minute, but definitely not hours.

I’ve applied the bugfix as mentioned. I’ll watch it over the weekend and report back on Monday.

SvenS · September 6, 2021, 6:36am

The bugfix works great. Didn’t see any false alarms over the weekend.

xavierstarwin28 · November 12, 2021, 5:11am

I am facing similar issue with activating alerts want to apply the bug fix , since i am new to checkmk , could you please provide the detailed steps to apply the bug fix.

moritz · November 12, 2021, 6:34am

Hi @xavierstarwin28 ,
This is one of the legacy checks, which means it’s residing in share/check_mk/checks/systemd_units.
If you copy it to local/share/check_mk/checks/systemd_units, that file will be used instead of the shipped one. You can then modify the file as you wish. If things go wrong, just remove the file.
The fix to this particular problem (systemd_units_services_summary: incorrect activating/reloading period shown in service) however, as been released with version 2.0.0p10.

xavierstarwin28 · November 12, 2021, 3:00pm

Thank you very much !!!

flo · June 10, 2022, 11:42am

From what I can see, the proposed bug-fix and it’s integration into version 2.0.0p10 does not solve the original problem. I have here an installation with 2.0.0p20 and the Systemd Service Summary still reports my service (onshot, triggered through systemd-timer) as being activating for 71 m (warn/crit at 30.0 s/60 s) which results in a wrong CRIT state.

moritz · June 13, 2022, 7:18pm

The original problem as I understood it was that a service was considered ‘activating’ for over 10 hours, when in fact it wasn’t. That is fixed.

Your problem (if I understand correctly) is, that your service is really “activating” (which for a oneshot service basically means “running”) for a long time, but that is fine.
In this case you could either exclude it entirely from the monitoring, or configure the long activating period to be OK.

system · June 13, 2023, 7:19pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.