Acknowledging problems fails sometimes

CMK version:
Checkmk Enterprise Edition 2.2.0p33

OS version:
Debian 11 Bullyseye

Hi all,
we have a distributed monitoring with multiple CMK slaves in different regions and one master.
We noticed that sometimes the acknowledgement of a problem doesn’t work and we have to set it multiple times, until it’s finally working and shown in the GUI as ACKed.

I checked the cmc.log on the slave and only the successful ACK is shown there. All unsuccessful ACKs before are not shown at all.

OMD[sta2]:~$ cat  var/log/cmc.log | grep ACKNOWLED
2024-09-27 10:39:01 [5] [core 3290599] Executing external command: ACKNOWLEDGE_SVC_PROBLEM;hostnameXYZ;HTTP ssl-cert;2;0;0;mlbz;Debug test;1727685541

Same is happening if you try to remove an existing ACK. It does not always work on the first try and you need to repeat it multiple times.

So it looks like not all ACKs or the removal of ACKs are forwarded from the master to the slave.

Has anyone faced this issue, too?

Cheers,
mlbz

Hi Mlbz,

are there any Connection Problems to that Remote Sites with Livestatus, like Timeouts and such? And are you using the Livestatus proxy?

Thanks for your fast reply! The master and the slave are in the same subnet and I am not aware of any timeouts or other connection issues. And yes, we use the livestatus proxy.

ok strange.

When you say: Does not work, does this mean the GUI idles and nothing happens, the page loads endless or what exactly happens?

Actually, everything seems to work fine. There is no issue in the GUI. It looks like it its working fine. You get asked if you really want ACK the problem. Then you confirm and then you click “back to view”.
There you can see, that nothing changed - means there is no icon for the ACK next to the service.

I just tested and it is actually the same behavior when I try to schedule a downtime. So no difference between acknowledgements and scheduling downtimes. Both work only sometimes and I cant detect any pattern.

2 Likes

Does nobody else have those kind of issues? We updated to 2.2.0p36, but nothing changed.

1 Like

Are you still facing this issue?

Yes, but not as often as before.

1 Like

2.2 is borderline out of support to properly out of support. My personal suggestion is (with proper backups of course) upgrade to the latest patch of 2.2, then to at least 2.3 latest patch. There can be breaking changes between major versions, especially if you use third-party/exchange MKP or plugins.

It might be best to turn off notifications on the site, take a backup of the site without history or logs, turn notifications back on for the main site, restore the site with a new name (_qa for example). And then test the upgrade path/process on the new instance so that you don’t risk prod and can identify possible issues.

1 Like

It still happens after the update to 2.3. But not as often as before.