Checkmk 2.5 Beta and Community Call

Hello Community! :waving_hand:

We are pleased to announce that we are ready to launch the Beta Testing phase for Checkmk 2.5!

This release introduces a number of significant improvements, including enhanced Azure monitoring, OTel integration for cloud-focused editions, new dashboards, and various interface refinements.

We have dedicated the past months to thorough testing and checking of all changes, and we expect that all major issues have been identified and resolved.

However, since we cannot replicate your unique environment, we invite you to put our work to the test. By reporting any bugs you discover, you will help us ensure these issues are fixed before the official release, guaranteeing that Checkmk 2.5 works flawlessly in your specific environment!

Here is the process for participating in the beta:

  1. Prepare a test environment: This environment should be as close to your production setup as possible. Please ensure that your production environment remains secure, as a beta version is not final and could cause unexpected issues.
  2. Download the edition of Checkmk 2.5 that you normally use in production
  3. Have fun testing: explore different processes and scenarios – whatever you usually do with Checkmk, including corner cases
  4. Report any incorrect behaviour:
  • If you have a support contract, please report bugs through the designated category in our support portal.
  • If you do not have support portal access, we have created a category for beta testing on the forum for your reports. It will be monitored by our QA engineers, who will initiate action as needed. Please ensure you do not share any sensitive data there.

To provide you with all necessary information and answer any questions, we will host a community call about the Checkmk 2.5 beta testing with Gregor and Nadia, members of our QA team. Join us via Zoom on 26th of March, at 3PM Berlin time.

We look forward to seeing you there!

5 Likes

Anyone who runs a simple “omd cp mysite newsite” or “omd restore newsite” , updates the newly cloned site, and then starts the new site to verify that everything still works correctly after the update and to explore the new version, without first taking proper precautions beforehand, can quickly run into serious trouble. After all, no one wants to end up with two sites that both query the same hosts, send alerts, and interfere with each other

Given Checkmk’s extensive experience, it would be highly beneficial to provide a clear and practical guide for setting up production-like test sites. Such a guide should cover essential topics such as disabling alerts, avoiding agent queries and configuration pushes, refraining from updating agents, and using simulated agent data. This would help prevent common mistakes and ensure stable, realistic testing conditions.

4 Likes

You can (and probably should with any test upgrade) run the copied site in simulation mode. Check the following:

Notifications need to be disabled once the site starts up, but simulation mode should be able to reduce the noise of any false/duplicate alerts between startup and when you get into the master control to disable them.

One option to use:

omd start
lq "COMMAND [$(date +%s)] DISABLE_NOTIFICATIONS"

That should disable all notifications on startup without having to get into the UI.

Beta releases are never tested with production hosts. Beta releases are deployed with Ansible just like our production environments but only have a few test host of various OS types where we check if basic stuff breaks (as it always does)

Once “stable” is released its upgraded into our dev environment, then to UAT and finally Prod.
UAT and Prod have CMDB connections and UAT is large enough to mimic production

Simulation mode in our cased does not really do much as its to far away from production like monitoring with all special agents, distributed monitoring, exports to influxDB etc. etc.

Now it will be even harder if exploring the OTel path…

1 Like

Even with thorough testing and by skipping the first few releases, we’ve still encountered unexpected issues after major updates - problems that only appeared in the live environment. Thankfully, the Checkmk team has always provided quick and effective fixes. Many thanks for the great support!

Since most issues and edge cases only become apparent under real operating conditions, testing should take place as early as possible and in environments that closely mirror production. In a complex MSP setup, that’s often easier said than done.

Simulation mode is a great starting point. It would be safer and more efficient if it could be enabled directly via an OMD command before site startup - automatically disabling notifications - instead of requiring edits to global.mk or manual adjustments in the GUI afterward.

One possible enhancement could be to allow simulation mode to periodically copy cache data from the original site on demand. This would make it easier to replicate changing agent data and other dynamic system states.

The better Checkmk supports users in building test environments that reflect production more accurately, the earlier we can help make Checkmk stable and reliable throughout its development process.

A lot has already improved, but realistic testing remains one of the biggest ongoing challenges for us.

2 Likes

What we did in our CI/CD pipeline was to write a Checkmk Agent simulator. We wanted end-to-end tests as networking is a big part of this. The simulator would listen of a bunch of different ports and was configured with Ansible. Worked quite well but was ditched down the road when TLS on the agent, push agents etc. came into play.

This would also only solve agent based monitoring

2 Likes

As mentioned in the Community Call: I am using Simulation Mode for the last two years on a copy of my production environment. For the last year or so, this part of the documentation

All active network queries (ping, HTTP, etc.), will be ‘bent’ to 127.0.0.1.

is no longer true for me. Certificate checks or HTTPS checks are tried against the actual host, usually failing due to network constraints.

I am setting

simulation_mode = True

in /etc/check_mk/conf.d/wato/global.mk on all involved sites. The corresponding element in the GUI confirms, that Simulation Mode is on.

Is there something more to configure? Or can someone confirm this problem?

Hi @joernc !
I believe this is the bug we discussed during the call – the QA team has already created a ticket for it.
Thank you for reporting!

Yes, that was me :slight_smile:

I wasn’t sure if I would hear the outcome of that ticket - especially if this is a local problem you can’t replicate. So I wanted to involve some more people and check if they have the same problem or if I am seeing pink elephants.

1 Like

I strongly suggest not only running after that minor issue with the simulation mode - please once and for all fix it and make it useable.

And since at least 5 major releases pointed out - it would make it so easy, if it would work “on the fly” for all your customers and would enable you to gain much more valuable feedback quickly, without challanging and taking valueable time from us all.

https://ideas.checkmk.com/suggestions/338091/omd-start-site-rescue-the-ability-to-launch-a-site-in-all-off-mode

All the best with 2.5

Cheers

2 Likes

@Sara it first felt like a Déjà-vu yesterday - turned out we had this chat already 7 month ago

Hi @foobar,

Unfortunately, I have no updates on this for now.