CheckMK Raw Host Limit

Good Day All,

 Question: Is there a hard limit for the number of hosts that a single CheckMK server can support? The documentation online states 100+ hosts but I am looking for a more specific answer. I am moving to a new job soon and the new place I am going to has around 900 servers as I was told in my interview.

Thanks
Steve

Hi @shandy4473,

to give a quick and simple answer, yes. It’s smooth sailing for Checkmk (if you use the enterprise edition). Just make sure that your Checkmk server got enough resources to handle the workload properly.

A good sizing guide for that is the official Checkmk appliance → The Checkmk Appliance

In case you don’t have the enterprise edition, it’s a bit more tricky. But to have the best experience, the enterprise edition will save you some tears. :wink:

If you are interested in the even more extreme use cases, I can highly recommend the following talk from the last Checkmk Conference. You will learn about how it’s possible to monitor 100,000 + hosts:

Hope this helps. :slight_smile:

Best Regards
Norm

Short answer no.
I had RAW systems with 1,5k hosts - running without problem.
The more important question would be - how many hosts should be in one site.
Here i would say for RAW that 1k is a soft maximum.
If you have enough hardware resources you can define multiple sites on one machine and work with large numbers of hosts.
It depends on the hardware resources available.
At one point it becomes valid to compare the hardware/runtime costs to the licence costs of an enterprise edition.

Hi Steve,

I think @Norm and @andreas-doehler have already summarized it well.

Additional wrinkles are:

  • what are you monitoring - hosts monitored using Checkmk agents need a lot less resources than SNMP-based hosts or active checks?
  • what is your monitoring interval - are you on the defaut 1min or do you deviate (up or down) from that?

I recently talked to a Raw user running ~20k hosts with 500k services (almost all agent-monitored linux machines). They are spread across ~20 sites. The larger ones of these sites have significantly more than our recommended ā€˜soft’ limit of ~1000 hosts. But those machines are also really beefy bare metal machines (much larger than even our top-of-the-line physical appliance), with 32-core CPUs and tons of memory. So this company is currently doing the calculation @andreas-doehler mentioned, whether it’s not cheaper to upgrade to the enterprise and get rid of 75% of these machines.

2 Likes

That would be a fun little talk at the conference :smiley: a chance you can convince them to talk about it?

1 Like

@gstolz I am trying for sure… for the US Conference :wink:

1 Like

Maybe to continue this topic.
Got a distributed setup in the raw version here. (1 main server, and a host per datacenter to keep monitoring ā€œlocallyā€).

The question is, is the soft limit of 1000 also applicable for the enterprise version?

Meaning, some of my vm’s have 16vCPU, and 24GB of memory. But i still need to add about 4k-ish hosts with about 5-10 service checks per host. I have enough resources at my disposal but as mentioned before: is the enterprise version worth the cost and most important, will it solve my issue?

implementation is mainly done from nagios checks, and those will gently and very slowly migrate. But the as is migration will be nagios checks.

1 Like

The question is, is the soft limit of 1000 also applicable for the enterprise version?

No, definitely not.

The microcore is an order of magnitude more efficient. I recently worked with a customer here in the US (~20k hosts, ~400k services) who took their infrastructure from 15 beefy bare metal machines down to 3 going from CRE to CEE. Could’ve gone down to 2 just resources-wise, but then geography kicked in, as they have one DC in the US, one in Amsterdam, and one in Singapore.

In that specific case, going to the CEE almost paid for itself just by the reduced resource footprint.

1 Like

Here comes a small ā€œbutā€ - if you have a highly volatile environment than it is better to split to some smaller sites. Like a site for network and one site for servers and one extra site for kubernetes.
The biggest problem also with CEE is the config activation and this is also a thing you need to consider in your planing.
These smaller sites can be on the same bare metal machine without problems.

1 Like

Hi Andreas and Elias. The environment is mainly vmware and hardware devices. Idea is to use other monitoring products for kubernetes/cloud/… (company decision). so it is pretty ā€˜stable’.
config activation is via web a struggle reaching 110-ish seconds. I ran into that timeout multiple times. but it seems from other forum posts you can extend this value in the cee version too. the amount of hosts was the culprit here.

Snmp timeouts on storage devices is another issue we run into next to the amount of hosts.
So all these issues seems to be resolvable with the cee version, but my main concern was the amount of vm’s configured into the system.Testing it in my distributed setup was not very handy, hence my question here on the forum.

many thanks for all the feedback so far! i’ll have a chat internally and reach out via the correct channels.

br

Then you have already a configuration problem in your site. Some of my bigger sites (around 2k hosts) don’t need more than 30-40 seconds.

No - if you have real SNMP problems then you will have these problems also with cee.
Inside the cee the SNMP queries are done with a better performance for the monitoring host. But if the answer from the queried device is too slow then it will be slow also with cee.

Hi Andreas (and Elias of course), thanks for taking the time to respond. Very much appreciated!

In regards to the config problem. No clue where i might screwed the config.
2K hosts, with the nagios backend is indeed doable.
My main site and the the largest distributed site (which needs to go to 4-5k) are in the same network/location. So no clue where my config issue might be. Anyhow, i already discussed and the fact we can find some budget (it seems) we will refactor to cee, and probably involve support to go over the config and provide some feedback/best practices. Providing the entire config here seems a bit too much. (topic was also about the raw host limit and don’t want to hijack this)

about the snmp part:
Main issue is in snmp with brocade switches. they are not the fastest, but certainly not slow either.
The way our engineering department designed everything is in the most cost/efficient way. Some of the switches have about 10 Virtual Fabrics. That means about 500+ servicechecks for 1 host. (which i can imagine is a lot.)
I fixed this at the moment by splitting them up per 2-3 VF per host. This works perfect.
Once cee is in place too, we can probably club a bit more.

Many thanks aready for all the reply and insights on the limits.

BR

what a relief, that entreprise version. :smiley:

A vm with 16vcpu, at least 80-90% in use (with about 900 hosts, max 3-5 checks per host) to, and i’m not lying, 4% cpu load.
Now lets follow the proper channels…

many thanks again!!!

3 Likes

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.