Voicing your questions about Checkmk Development

mike1098 · January 13, 2023, 1:43pm

We have all support options and still do the QA for each release…
Even more clever.

foobar · January 13, 2023, 9:07pm

I would say right now it’s for us, for every co-worker, about 20% spending time in debugging/reporting bugs, opening feature requests, because its not a bug and finding workarounds related to CMK. So thats about one! (1!) headcount on 5 people.
And @mike1098 absolutly right, time thats missing for daily business and projects and it looks like we are not the only one spending so much time when i read between all the posts in the forum?

foobar · January 13, 2023, 9:12pm

huge upvote for this!

LaSoe · January 16, 2023, 11:11am

tosch · January 16, 2023, 4:19pm

Is there already a feature request? We should vote for!

foobar · January 17, 2023, 10:27am

MKPs

This is another topic where we think its moving into the wrong direction.

Tribe29 has announced at the last conference that they plan to free up more resources for core development and therefore customers should go to the business partners
for check development in the future. We understand the intention behind it and don’t think it’s a bad idea in principle. However, we think that MKPs are the wrong way to achieve this goal longterm:

developer guidelines
Many details are still missing in the guidelines for coding check plug-ins, such as example the whole inventory part. For an experienced developer who reads the git commits
and can reverse engineer the code, this may work. But for an ordinary customer/programmer, it remains a challenge even after attending your course “Programming extensions for Checkmk”. In many places the
check developers are dependent on the way Tribe29 implemented general functions (example cluster), but you don’t get support if they don’t work as expected or when important functions are missing.

Partners who develop the same things, without knowing from each other’s (same without bug tracking, some of you customers opening and reporting the same bug)
As now being forced to go to a partner to get new checks, its also clear that most of the customers not having 2-3 partners and its also clear that partners not talking about everything between each other’s.
That’s leading to the point that partners developing the same check for different customers at the same time and at least one of them is “wasting” time. Sure, from partner perspectives everything is fine, as written above, more revenue which is good. But for the community its bad and customers its bad.

the time from one partner could be used to develop something else, not existing
it seems that not all partners releasing there MKPs (different reasons and also understandable), but again bad for the community and CMK it self as its not available to the public

Not all customers have a partner they trust and not all partners have the resources or are willing to accept new customers only to develop new checks.
When different customers talk to different partners, you end up with a lot of individual solutions fitting only the need of one customer and no common development.

MKPs and CMK Updates
As we had many many MKPs send in the past by you, for example because it was possible to extract that feature we could*t wait another 9 month for, we know also very well about the problems MKPs can come with and over time WILL bring.
.

code changes
If Tribe29 no longer has the checks under its wings, you can no longer evaluate the impact of your changes and you lose the connection to your customers and what is really needed by them.
It will also be extremely burdensome for all your customers to test all MKPs before implementing any patch release because you no longer can ensure that everting works as expected.
This will costs some of your customers monthes of preperation before a major release and even a lot of extra time before a patch release.
And again, the number of MKPs supposed to grow in the future due to the changes regarding feature requests
loosing track if MKP still needed
Sometimes you receive a patch by MKP, but its still not implemented in the next checkmk patch release - you need to track it somehow.
Same goes for Major releases where you might have received a workaround for the actual release, for whatever reason but you have no clue if its implemented or still needed.
Means again, huge amount of our time, and some of yours as well in terms of communication and testing. The same applies to the workaround MKPs we received sometime for the current release.
We as Customer never know if it will still work properly or as expected when the next major is released, because you didn’t have it on your screen when working on the new release and there seems to be no
process about notifying any of your customers you developed and send MKPs if something changed related to it.
CMK Support
As a paying Enterprise Customer, it’s important for us that we have full support for everything including the QA. We are counting on you to make sure that everything works after a new release, and if not, that you take care of it.
Who is taking over the support for all the MKPs which are not working after an upgrade?
3 party developers and Enterprise Customers
This is related to point 3.
For many its not allowed to install 3 party “addons”. The reasons are mostly because of lack of support. So if you are installing 30 MKPs, from 20 different people
you can imagine where this ends up in terms of support and major upgrades or “developer guidelines” changing.
Same with other, “self-written” code/projects, people move on, get a different job and not maintaining their stuff anymore. most of the time anyhow, written in there freetime
MKPs vs. CMK
Sometimes it starts with MKPs and by coincidence, CMK develops the same and actually having similar and even the same names.
Problems
5.1. Same name, but not same features
Speaking from experience from us and some of our customers, the problem is that that often parameters/details are missing so on the one hand site its clear you want to use whats fully supported by CMK, but on the other hand you cant wait for another half a year or even longer, IF your feature requests will be developed to add the missing “features”
5.2. amount of time to spend
For merging/rework your own checks.

Dont get me wrong, the idea of MKPs in general is great - really! Its a way where you can include the community and its possible to easily extend checkmk, with simple things.
BUT I believe a majority of people would love to have more of them included and supported by CMK, for many reasons written above. It’s clear that it’s a lot of extra work and nearly
impossible to add and support all. On the other hand, it would also create more jobs? We had the discussion about code quality in the past. Proper developer skills (from users perspective) and
proper developer guidelines will for sure help to reduce a certain amount of time to align with the main branch. We did this with you in the past and you’ve you’ve been surprised about the code quality and even adapted some of the unifytesting ideas our developer had.
Maybe there is a way to meet in the middle. Let’s say, picking 1-3 MKPs per month to include in the main release?
Just braindumping - could be top voted (but please not again like the feature portal) or top downloaded for example. That would decline the amount of MKPs by 12+ per year.

And if you add a better way for partners as well, who should anyhow know better about code and development guidelines, then we could add a huge amount of, at least, new checks and get the same dynamics back, which made CMK in the first place that great!
And with the right tools, partners could work directly on bug related tickets, opened up by customers not related to them. Of course this would imply that partners also have an interest on bug fixing and further developing their checks. (could be written in new contracts) But I think at least some, from what to see on github could be falling into that category.

One idea could be to certify Partners as CMK Code Developer?

goal for Partners to get the status
more revenue for partners
ensured they have the knowledge to fully develope under your guidelines which cuts down your time involved (presupposed there are developer guidlines without any gap)

And its not only partners. A huge boost for the success of CMK been the first years where from every corner of the internet, bits and pieces been added. Family - Everyone contributed
here and there to CMK and made it this huge conglomeration of checks - which was great!
We fully understand that in terms of QA and maintaining code it’s a huge problem, but there are ways to handle and deal with it. Maybe you even might not know about, but chances are high, some of your customers with more then 20k employees faced it already.

LaSoe · January 18, 2023, 3:56pm

Anders · January 18, 2023, 6:15pm

Only been using Checkmk for a few years, started with 1.6, But some things that I feel is really strange are:

Checkmk monitoring of Checkmk
So a monitoring platform that can’t monitor itsleves and take actions is strange. Have you ever heard about a SQL server not being able to show issues?
This is particularly a concern to be able to run some kind of H/A setup. If I disable Apache or one of the services in Checkmk not a single alert will be trigged, no services will go CRIT. Even OMD Performance checks could have 100% helper usage and nothing…
The 2.1 Checkmk built-in “internal” dashboard are a joke (Most of the times they don’t even work as they rely on the Checkmk agent - fair but it the agent stop the site will still run (in a distributed setup)

All customers are not monitoring 5 hosts
The definition of a trivial change is the sole decision of Tribe as an example Lets say you need to re-create or change 1000 rules as you have 100.000 hosts that’s not a trivial change, it has to pass change managed, get approval, be done during a maintenance window etc. There are many examples, for example if you restart OMD and the SW/HW inventory starts, not a problem on 5 hosts, but a huge problem on 100.000 hosts.
This relates to the way Checkmk also works on the filesystem. In our case SSD drives are to slow. You might have to read that twice. Due to the structure of hosts.mk, rules.mk and others that is constantly read and written to.

CI/QA
We had to setup our own CI/CD pipeline to test automatically build a environment when there is a new checkmk version as we know its not “Stable” (using that meaning for Checkmk is just a joke)

Some of my best examples are the new RestAPI. WERK after WERK states that the API does return HTTP 500 - Sure beta releases but these are PRODUCTION stable releases and Tribe have not even tested the API. (It’s also funny that all docs say its “versioned” according to best RestAPI practices. Well the API since 2.0 first release are at “API version” 1.0 … We cant upgrade to 2.x until this is fixed

There are also a lot of things working strange, that should be OOB - For example there have been issues with how something trivial like Memory consumption is being presented. Yea it was incorrect.

Tribe are so proud to say they have support for 2000 devices or something like that, but perhaps none of these have been tested before a new release. I actually expect that if Tribe says they are supported. Otherwise remove them and add them to Checkmk Exchange with no guarantee.

Security
I don’t want to go into to much details but Tribe are missing some fundamentals in how you run security in todays connected world. You don’t issue self-signed certificates for something as critical as monitoring, and if you do you don’t let them expire in 1000 years (Yes thousands years)

You also don’t FORCE your users to embed certificates in the agent! You rely on the OS to handle certificate chains. You also ensure the certificate process is consistent. You don’t have one process for the agent controller and another one for the Agent Bakery with automatic updates.

If you allow the user to run the agent as non-root (Should be the default) you have a process to allow agent bakery to update the agent (this is not possible today) - It’s also not possible to change the user on Windows (only linux)

Local Checks / MRPE / Classical / Nagios
We expect that all types of checks can only be installed by adding extensions. LCM on these is a nightmare and as you can do practically what you want under ~/local there is no audit trail of who did what. There are also no way to identify what Service names a check might produce.
You can write your checks in any language, I have written checks in bash, powershell, php, python and perl - that is of course a great advantage from a developer point of view, but a nightmare from a LCM point of view.

Roadmaps
Would like to see more frequent updates on roadmaps, and insights into roadmaps just like most software companies does it, especially for partners and key customers.

I think Checkmk with 2.x have made really great progress, and I feel that the changes from 2.0 to 2.1 was more significant than the ones from 1.6 to 2.0 and there are a lot of new features and I’m looking forward to 2.2.

mike1098 · January 19, 2023, 10:11am

Maybe I misunderstood you but basically checkmk monitor itself out of the box. I recommend to open a dedicated thread to discuss. Possibly the community can help to find a solution for you.

I agree to that.

In several domains we are hardly missing CI/QA. The testing of the code is a real burden to us.

The attempt to make the agent highly secure failed because it introduced other security risks.
See also:

At least because of that we have to go on with the legacy agent. Hopefully it will be supported in future versions.

It depends on how your change management process and your organization looks like. We have segregated the duty between monitoring architecture/development and monitoring run team and with help of change management we have any deployment documented. I can understand that this may different at different customers. If you want to discuss in detail I recommend a dedicated thread.

Yes that’s the part about transparency.

Sara · January 19, 2023, 12:43pm

Thank you for all your feedback, we appreciate it.

There are quite some points to think about and analyse. I will discuss them in the respective teams, and we will come up with plans to improve upon the feedback.

I will share next steps with you by end of next week.

Sara

foobar · January 19, 2023, 3:22pm

Feature Portal

@sara as you asked for it

1. Votes disappear

Happened many times. Not only we noticed it. So you end up having 40 votes on a feature request and then, suddenly you go down to 29. Not only happened once, after we noticed and tracked it. Seems to be mostly Wednesday where the magic happened.
The disappearance of votes is like saying “our own customer survey showed something different”. Very intransparent.

2. Votes not counting:

2.1 from the same IP not counting

Of course related to 1., but shows a big problem. Let’s assume we have 500 users (only internal), 50 of them are trying to be involved and asking and demanding changes/feature requests to ensure a smoother and more stable and automated daily business, while working with CMK. So, 50 people who heavily using CMK and would vote, for there features

To quote the company behind the voting portal:

We throttle voting by IP address combined with the “User-Agent” text sent by your browser with each HTTP request. You may be surprised to hear that we allow multiple votes per IP address. The limit is very low. We allow this so that a person can ask one or two relatives, friends, or colleagues to vote up their submission. This lets the user feel they are gaming the system while in fact, they are having little impact on the final vote.

I like the last sentence especially.

So imagine we are a company with over 500 active CMK users, with dozens of departments and only three of them can vote or would count?

As a CME user, we have in addition to that dozens of customers, who also request a lot of the features we create requests for. Same goes here as well, only 3 votes,doesn’t matter if we have 50 customers?
They are buying monitoring from us and expecting its working and asking us to make sure, more features are implemented – should we really bring there attention to the portal and even to the forum?

2.2 “Bulk voting”

Imagine you are in a meeting an presenting the feature portal to your coworker, as a way, we as a company have a “chance” to get a feature we desperately need.
Now everyone opens up the website and starts voting - but hey, there is a limit as written by the company behind it.

We throttle voting by suggestion. If a suggestion is receiving far more votes in a short timespan than is normal, it is probably manipulation. So we will silently ignore votes. Our throttling has several levels, including per minute, per hour, and per day limits.

Yeah, time wasted again for all involved.

And you only know it if you are starting questioning what you noticed over weeks - reference 1.

3. Balance/value between votes from Enterprise Customers vs. RAW Edition

Related to 1. & 2.

The point was already made directly after the conference – I think it was @mike1098 in the community call

Paying customers, who are making sure CMK keeps alive and will move on in the future, have the same impact on votes as the ones “getting it for free”.
From a paying customer perspective, this does not feel fair.
Some more examples

Enterprise vs. RAW
DAX (Deutscher Aktienindex) / SIM Swiss Market Index Company’s vs. Handelsschule Kirchhellen (no offence! )
Heavy User (1mio+ Services) vs small User (5k Services)
Infrastructure driven vs. Application driven

To name another example as there been a couple of UX Topics lately:
A customer with 10 users has the same weight as a customer with 500 users where, a GUI improvement has a much greater impact.

4) Fake / manipulated votes

Also mentioned and tested in the forum before, that it’s possible to manipulate the votes – Thanks @simonm

So how can we really rely on the outcome of this votes? Furthermore, well noted by tribe29 half a year ago

“this is a problem we will have to think of and should solve.”

But where are we standing? When will it be solved? Is it on the Roadmap?

5. expectations vs. reality

Mentioned on the conference and also clearly by tribe29s PM.

“Ensure we work on features most important for the entire customer base”
“top voted = “candidate” for next release”

5.1. deciding what’s going to be implemented next

Of course it’s your right to decide what to implement next, but also creates an illusion for many of us.
Let’s take the first page of the Feature Portal. 50 Features, all about 30+ votes but only one implemented and only 6 scheduled

12% = planned
2% = implemented
86% = waiting feature requests by customers

For a customer, it does not really feel like “Ensure we work on features most important for the entire customer base” comes to life.
So to quote tribe29s PM

From my experience, there will be a long tail of request, which did get a single vote from the creator. Those will be closed. After 2 years, requests with 2-3 votes will be closed, or votes from a single company. It is all about keeping the portal healthy and useful

This sound more like we going to have a LONG tail of highly voted and needed features where the community, who is willing to help (and many even to pay) you with this features to make CMK even greater, will wait years for implementation, if even.
(Sidenote here, imagine you want to move on with with your monitoring, so what do you do if you cant accomplish it with the tool you prefer, start working with the ones who can help you solving problems in your daily business. Specially hard when your know it would be all possible with CMK - It still has the potential to get close to a “Eierlegende Wollmilchsau” of Monitoring tools)

5.2 fully understand the need of a feature request.

“Enable us to interact with customers to fully understand the underlaying need”
“PM clarifies requests”

14/50 (28%) Interactions with customers regarding the feature requests. Only considered the the first page of the Feature Requests Page, getting worse as less votes on features (will match with 6.) Correct me if wrong, but most likely because you are not even considering it? Patterns are there.

6. Human interaction vs. features to be recognized and upvoted.

Also recognized by tribe29 about half a year ago.

Also the interface of the portal is also quite influential, e.g. in the default view the highest voted suggestions are listed at the top. Guess what gets the most attention ;-).

As there are plenty of books about human interaction and how simple we actually are, same goes for our browsing behavior. There are plenty of papers written about it but to shorten it up, nearly nobody looks on search result “page 2” on google.Same goes with the feature portal as the amount of features is growing and growing. 1 implemented vs. 40+ new.

To quote your PM:

After 2 years, requests with 2-3 votes will be closed, or votes from a single company.
It is all about keeping the portal healthy and useful

If there are no people who are actively, checking the newest features and upvoting them, so they will end up on the “trending” page, they will just vanish with hundreds of others, sometimes really creative feature requests on some of the last pages and nearly no chance for someone to get it upvoted.
With 2-3 pages, not a problem, but with 12 pages already?

7. Time consuming

Who has the time for that? As mentioned already in this thread, it feels like we are doing a lot of QA and losing a lot of time with bugs, debugging and finding workarounds already. So to make sure, we as CME Customer can move forward and open up new markets, we need to check all feature requests and vote for everything which will make CMK better towards our needs and at the end to our customers? With an outcome of how much percent of our feature requests actually getting implemented?

8. Transparency about the process/status of feature requests

Here we jump back to the post from @PhilippL regarding the Public bug tracker and transparency.
@PhilippL also mentioned missing transparency regarding the Feature Portal. Feature requests since May with many votes are still “under consideration”
Nobody knows what that means (and feels nothing is happening). Some assumptions:

It does not fit in your Roadmap
tribe29 does not have time for that
tribe29 thinks it’s not relevant
It’s too time consuming to implement, you are focusing on a small thing (quick win)
…

Here we cross point 5.2 - first impression, nothing happening.
In fairness, IF you look into details and actually open up the feature request, sometimes you find at least some feedback from the PM. But still the status stays the same and often, even with engaging PM, we don’t know much more about it. If it’s going to be made in the first place and IF, in which release it will be implemented.

9. When do we know it’s the time you decide, what’s going into the next release?

This goes straight back towards point 8. - transparency
There is no “last call for votes for 2.2 release” or as far as I know, it seems now the call for the 2.3 release already started. We as customers have expectations and hopes that some of them of our feature requests will be implemented in the 2.2 release or even in the 2.3 release, but we don’t have any feeling and understanding and insides on how, when, and IF - nobody knows.

10. Reach - How many customers are using it really?

Isn’t it more or less similar to the forum, where just a minority of your customer base is actually using it? I’m not sure, was there a newsletter advertising it? If so, super! because you will reach definitely more.

Let’s assume we had the Conference where you announced it with 500 Customers (direct/stream) and then the Community Call with about 300 views on YT (so let’s assume 50 who didn’t heard about it before)) So due to the Forum we will put some extra, let’s make it 800 you reached. We have then maybe for many feature requests the same problem like with the “deprecation of service tags”, “nobody” uses? Just some thoughts to think about.

For example: https://features.checkmk.com/suggestions/299422/show-netapp-volume-efficiency#comment472115

I heard there is already a Partner who developed this, long time before the Feature Request was made, but nobody knows – not available to the public.
So imagine how much interaction you are obviously missing, in the feature portal from your customers
Possible in this case:

Not one of them who are using it was willing to share the information
They don’t even know/care about the feature portal and the request it self

From customer perspective the Feature Portal is for sure! an upgrade to the Forum solution “we had” before. And even if most likely not even 20% of your customer base are using it shows how much potential CMK has and how many fantastic feature requests users are coming up with to help and make CMK getting ahead of their competitors and make it a more sophisticated Monitoring tool as it is already.

But I’m not sure what you expected from it? I already said at the conference that I expected 300 votes till the end of the year (boy was I wrong!) and that it does feel more like a way of marketing to engage customers again and feel them being part of the product, like it was for many of us as they saw CMK growing. So far there was not much convincing us different. For every implemented feature, 40+ new feature requests open up. Huge potential, clearly, but like with everything, people losing trust and interest if nothing happened.

I truly hope that “Ensure we work on features most important for the entire customer base” will become reality again in 2023!

simonm · January 23, 2023, 10:39am

Also mentioned and tested in the forum before, that it’s possible to manipulate the votes – Thanks @simonm

Without wanting to get into the rest of your post, I would like to point out that mine was not related to faking/manipulation of votes, but addressed a different problem.
I just want to be understood properly…

lars.getwan · January 24, 2023, 6:08am

Of course, you don’t pay for the automatic ticket closing. If credits are charged at all, depends on the type of contract you have. In that case, the credit was charged for the time the support guys were working on your ticket:

1-15 minutes => 1 credit
16-30 => 2 credits
…

When the ticket is then set to resolved, you get a summary on how many credits were charged during the ticket’s lifetime.

mike1098 · January 26, 2023, 2:45pm

We are customer since 2016 and the handling of our tickets regards the payment, here credits, was always more than fair. It is more the case that Tribe29 supplied more than they charged us. I also cannot remember that we have charged for closing a ticket and if so I am sure Tribe29 is open to explain and negotiate. Its always a matter of asking.
The only thing I can complain is that its not so easy anymore to figure out what credits are consumed for what ticket. In the former customer portal it was clearly visible but that vanished. Looking forward to have that back.

So its not all bad

foobar · January 26, 2023, 4:03pm

@mike1098 there was a column in the “advanced requests report” for charged credits in your support portal

mike1098 · January 27, 2023, 9:52am

This column in Jira is empty!
I only see it at the very far bottom of the ticket

Its a real pain to browse through all the tickets to see if we are charged for something.

Please Tribe29 bring back an overview page which credits had been consumed for what as we hat it initially in our

Sara · January 27, 2023, 12:05pm

Dear community members,

We have discussed your feedback within the team and would like to create a space, in which we can discuss our proposals and next steps.
We will set up a series of closed community calls in English and German, so that we all have the option to speak openly and directly with each other.

We would like to be as transparent as possible in these calls, thus these calls will not be recorded and the calls will only be open to active forum members or Checkmk customers.
Thus you will have to register for it with either your real name, email and provide your forum name, if you are a Raw Edition user.

We will of course also share a high-level summary here in the forum.
We need some time to decide on some technical details: to set up registration form and the call details. We hope to provide those early next week.

Three dates will be available, so everyone who wants to join, could already block that:
German: 13th of February, 11:00 (CET)
English: 14th of February, 15:00 (CET)
German: 16th of February, 10:00 (CET)

Sara and the team

mike1098 · January 30, 2023, 3:17pm

Beside all the criticism against Tribe29 I also want to send an appeal to all customers and partners:

If you find a bug in checkmk its OK to apply a workaround to provide a quick fix, but please take the time and report such a bug to Tribe29 to give them a chance to fix it

After upgrade from 1.6 to 2.0 we ran in to a couple of issues even that we did an intense testing upfront. At least from one issue we know that other customers also had the same issue and just applied a workaround but never report back to Tribe29.
I know its some kind of work to collect all data for a ticket and help debugging but its worth for all of us to keep the software correct running and you probably will be paid back by other customers preventing you from wrong code.

I trust in Tribe29 that they handle such bug reports with the necessary care and do not also just provide a workaround

best regards

Michael

foobar · January 30, 2023, 3:25pm

So time for another support ticket?

bitwiz · January 30, 2023, 3:44pm

Given that there is no public bugtracker, this does not make much sense. Sending anything to feedback@ is akin to throwing something into a black hole, no one knows if or how often the same issue has already been reported there. And pretty much all the time when I sent something there there was never any response and also no fix. So why would I spend time creating a PoC and sending steps to reproduce there? No, I fix it in my environment and are on my way…