Voicing your questions about Checkmk Development

Only been using Checkmk for a few years, started with 1.6, But some things that I feel is really strange are:

Checkmk monitoring of Checkmk
So a monitoring platform that can’t monitor itsleves and take actions is strange. Have you ever heard about a SQL server not being able to show issues?
This is particularly a concern to be able to run some kind of H/A setup. If I disable Apache or one of the services in Checkmk not a single alert will be trigged, no services will go CRIT. Even OMD Performance checks could have 100% helper usage and nothing…
The 2.1 Checkmk built-in “internal” dashboard are a joke (Most of the times they don’t even work as they rely on the Checkmk agent - fair but it the agent stop the site will still run (in a distributed setup)

All customers are not monitoring 5 hosts
The definition of a trivial change is the sole decision of Tribe as an example Lets say you need to re-create or change 1000 rules as you have 100.000 hosts that’s not a trivial change, it has to pass change managed, get approval, be done during a maintenance window etc. There are many examples, for example if you restart OMD and the SW/HW inventory starts, not a problem on 5 hosts, but a huge problem on 100.000 hosts.
This relates to the way Checkmk also works on the filesystem. In our case SSD drives are to slow. You might have to read that twice. Due to the structure of hosts.mk, rules.mk and others that is constantly read and written to.

CI/QA
We had to setup our own CI/CD pipeline to test automatically build a environment when there is a new checkmk version as we know its not “Stable” (using that meaning for Checkmk is just a joke)

Some of my best examples are the new RestAPI. WERK after WERK states that the API does return HTTP 500 - Sure beta releases but these are PRODUCTION stable releases and Tribe have not even tested the API. (It’s also funny that all docs say its “versioned” according to best RestAPI practices. Well the API since 2.0 first release are at “API version” 1.0 … We cant upgrade to 2.x until this is fixed

There are also a lot of things working strange, that should be OOB - For example there have been issues with how something trivial like Memory consumption is being presented. Yea it was incorrect.

Tribe are so proud to say they have support for 2000 devices or something like that, but perhaps none of these have been tested before a new release. I actually expect that if Tribe says they are supported. Otherwise remove them and add them to Checkmk Exchange with no guarantee.

Security
I don’t want to go into to much details but Tribe are missing some fundamentals in how you run security in todays connected world. You don’t issue self-signed certificates for something as critical as monitoring, and if you do you don’t let them expire in 1000 years (Yes thousands years)

You also don’t FORCE your users to embed certificates in the agent! You rely on the OS to handle certificate chains. You also ensure the certificate process is consistent. You don’t have one process for the agent controller and another one for the Agent Bakery with automatic updates.

If you allow the user to run the agent as non-root (Should be the default) you have a process to allow agent bakery to update the agent (this is not possible today) - It’s also not possible to change the user on Windows (only linux)

Local Checks / MRPE / Classical / Nagios
We expect that all types of checks can only be installed by adding extensions. LCM on these is a nightmare and as you can do practically what you want under ~/local there is no audit trail of who did what. There are also no way to identify what Service names a check might produce.
You can write your checks in any language, I have written checks in bash, powershell, php, python and perl - that is of course a great advantage from a developer point of view, but a nightmare from a LCM point of view.

Roadmaps
Would like to see more frequent updates on roadmaps, and insights into roadmaps just like most software companies does it, especially for partners and key customers.


I think Checkmk with 2.x have made really great progress, and I feel that the changes from 2.0 to 2.1 was more significant than the ones from 1.6 to 2.0 and there are a lot of new features and I’m looking forward to 2.2.

1 Like