[Release] Checkmk stable release 2.3.0p15

Dear friends of Checkmk,

the new stable release 2.3.0p15 of Checkmk is ready for download.

This stable release ships with 8 changes affecting all editions of Checkmk,
1 changes for the Enterprise editions, 0 Cloud Edition specific and
0 Managed Services Edition specific changes.

Changes in all Checkmk Editions:

Checks & agents

  • 17243 USV UPS: discover devices with .1.3.6.1.4.1.43943 as sysObjectID…
  • 17148 SEC: Persist known host keys for checks that use SSH…
  • 16892 FIX: agent_kube: requests.SSLError raised on connection using self signed certificates…

Notifications

  • 17167 FIX: HTML Email: Add from address to log on SMTP error
  • 17166 FIX: HTML Email: Handle SMTP return code 554 as permanent error…

Setup

  • 17261 Support Diagnostics: Include information about the Checkmk Appliance…
  • 17262 Support Diagnostics: More detailled list of site’s files…

Site management

  • 17133 FIX: Fix hanging ‘Creating temporary filesystem…’ during update process…

Changes in the Checkmk Enterprise Edition:

Agent Bakery

  • 17093 Use SHA256 digest when baking RPMs…

Changes in the Checkmk Cloud Edition:

NO CHANGES

Changes in the Checkmk Saas Edition:

NO CHANGES

Changes in the Checkmk Managed Services Edition:

NO CHANGES

You can download Checkmk from our download page: Download Checkmk for free | Checkmk

List of all changes: Werks

We greatly thank you for using Checkmk and wish you a successful monitoring,

Your Checkmk Team

I updated from 2.3.0p12 to 15 (CRE) a few days ago, and CPU utilization outside of spikes (which I think correlate to cmk_discovery becoming active) is up by around 13% without any real other changes such as host or service additions / modifications. Is this a common issue, and is there a particular change between these two versions that might be a possible reason for this?

As far as I can tell, there are no other reports of this behaviour, so it looks like this is something local to your environment.

Hmm. I’ve just done a jump upgrade to p19 after keeping p12 around for a fair while after rollback, and…


… while this looks notably spikier, and lower-level, more granular hypervisor graphs confirm that as well*, it’s still in a usable state. (whereas those ~ 13% increase on p15 had made enough of a difference to render the WebGUI frequently unresponsive and painful to use, and to cause fairly frequent check timeouts generating false positive alerts)

CPU expectedly goes to its knees when discovery rolls around (this happened shortly after that screenshot), other than that, it likes to reach 100% briefly when multiple check_mk checks are active (each process has a bit over 10% each), and they tend to come in large packs.

Besides this, bulked notifications getting sent seems to be quite the CPU hog when it occurs, maybe that can be a bit defused by using a plainer format.

* interesting thing there is, the spikes aren’t taller, the amplitude (never goes to actual 100%, outside of discovery, seems to be a rounded average of each minute) is actually identical on hypervisor graphs. They’re just notably wider.
On first look, by a factor of 2, but on second look, not quite. Previously, 3 minutes would pass between each “valley” in a very regular, straightforward pattern. Now, there are recurring intervals of 10 minutes, but with a more complex pattern: Minutes ending in 0 and 6 always have valleys. Large peak always happens on a minute ending in 2 or 3 (this varies, but it’s fat and even minute 1 and 5 are still generally showing somewhat high values), and a much smaller peak (low amplitude and always 4 minutes wide) happens on the minute ending in 7, sometimes 8.

The strange thing? This actually fits the present set rules for check_mk intervals better than the 3-minute-peak pattern in graphs still from p12, as there is no fitting rule associated with a bulky enough service to cause a CPU spike using a 3 minute interval, most hosts do have it on either 5 or 10 minutes.

It remains bizarre to me so far, but I’m not sure if I want to roll back to the now quite old p12 version again to investigate this 3-minute-interval behaviour further as long as it works. :neutral_face: