the new stable release 2.3.0p28 of Checkmk is ready for download.
This stable release ships with 14 changes affecting all editions of Checkmk,
7 changes for the Enterprise editions, 0 Cloud Edition specific and
2 Managed Services Edition specific changes.
Changes in all Checkmk Editions:
BI
17596 FIX: Remove double confirmation when deleting a BI rule…
Checks & agents
17370 Ship python package “oracledb” with omd… NOTE: Please refer to the migration notes!
17567 FIX: Fix predictions calculation for predictive levels…
Seen is that its complaining on all monitored hosts about a failed service.
These hosts are a mixture of RockyLinux 9.5, Debian, Proxmox, Ubuntu and SuSE Linux Enterprise Server.
When examining a host i did discover a failed service:
systemctl list-units --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● oes-telemetry-agent.service loaded failed failed OES Telemetry agent for OES
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
As a test i disabled it, to see if it would drop the reported crit:
+1 to Glowsome’s issue. I upgraded today from version 2.3.0p15cre to 2.3.0p28.cre and all my servers showed up with critical status (failed services) in the monitoring dashboard. It appears CheckmK stopped honoring the “Setup > Services > discovery rules > Disabled services” rules I had in place to not alert on lm_sensors, ntpdate, etc.
I also have the Azure metric-sourcers service on several systems. While I don’t have a rule in place for it, CheckMK did not alert on the service before the upgrade. Thanks!
If possible i would like to be included in pre-release tests, to verify the issue is no longer present.
As this issue is now persistent for 2 releases, it becomes sort-of an issue to justify updating over here.
Not to be/sound like a party-pooper or ranter, but i do think that quality control of releases needs more then ‘internal testing’ to ensure proper landing in the field.
I just need some additional validation to justifying upgrades - as there is still a sentiment of “if it works, dont fix it” over here.
First of all, thanks for finding the issue. I agree on the testing part. Here is an MKP, which simulates the fix that will land in p29.
Also, sorry for any inconvenience this has caused and thanks for the help!
Let me know if there are any new issues, or if the current one persists.