Basic Checkmk monitoring with ESXi and VMs

fig_wright · June 23, 2025, 1:29pm

Hi all, new user here.

I am trying to use Checkmk to do very basic system monitoring of a handful of ESXi servers with a couple of dozen VMs on. All I want to monitor is average CPU and network usage, so I can set alerts for unusually high levels - e.g. CPU >95% or >10GB network usage in a period of 12 hours.

I’m finding the web interface quite difficult to understand and more complicated than I was expecting. I have found the following two guides:

Using them I have added an ESXi server, created a connection rule, and connected successfully. I now see in the services: my VMs on the server, and some other stuff like filesystems and “Object count”. The two guides above then go into details on “piggy-backing” all the VMs, which it isn’t clear that I need if all I want to measure is VM CPU and network use…

Can someone please tell me:

Do I need to do all the piggy-backing and set up the list of VMs just to monitor CPU and network use of VMs?
Can anyone point to a guide that will show me how to set up CPU and network monitoring and simple alerts based on them? Do I need to install checkmk plugins?

paulosantanabr · June 23, 2025, 6:21pm

Create a host in Checkmk with the same name of one of the VMs that you want to monitor then run the Discovery and see if the data appears? To get more precise data you might need to install checkmk agent in the VMs that you want to monitor.

fig_wright · June 24, 2025, 1:44pm

Actually, I’ve just realised that Dynamic host management is not available in Checkmk Raw, but only the Enterprise version anyway. It’s not a big deal to create the VMs manually, as I only have a dozen. If I manually create a host with name of a VM, it does detect it and shows e.g. the following services being monitored:

ESX CPU
ESX Datastores
ESX Guest Tools
ESX Heartbeat
ESX Hostsystem
ESX Memory
ESX Mounted Devices
ESX Name
ESX Snapshots

I don’t see network of any kind in there, but there is CPU. How do I create an alert if CPU is over 95% for 12 hours?

fig_wright · June 24, 2025, 1:47pm

Also, I notice all the VMs seem to have the following error:

[API/agent]: host configuration requires a datasource but none configure

I can’t see anywhere to add this “datasource” ?

andreas-doehler · June 24, 2025, 8:31pm

For the data source error you need to set the host to “no agent” inside the host configuration.

Network statistics are not available for the single VMs only for ESX hosts.

For the CPU notification you can create first a rule that should go critical at 95% and then you need to create a notification delay rule for the wanted time. That means that a notification is only sent if the service stays warn/crit for the selected time in the rule.

fig_wright · June 25, 2025, 9:31am

Thanks for that info! I’ve got rid of the data source error now. It’s really not clear in the instructions that ESXi VMs should have no API or agent set.

I’ve set a CPU notification, but am not getting any emails now. I can see the rule is being hit because in the event dashboard I can see “Spooled mail to local mail transmission agent”. After a while I have finally added our mail relay host to the docket setup (amazingly, this isn’t specified in the docker setup guide at Installation as a Docker container ) with the command
-e MAIL_RELAY_HOST=‘mailrelay.mydomain.com’ (obvs putting our domain in) but I’m still not getting the email. Any thoughts?

It seems strange to me that this isn’t an internal setting from the interface, but whatever. Is there somewhere I can test the email? All the tests I can find seem to be testing external smtp services, which isn’t my issue.

Edit2: Ignore the above quote, the email just took a long time to arrive and ended up in spam

fig_wright · June 25, 2025, 10:47am

Okay, I’m trying to set up a CPU rule. I have both Linux and Windows VMs. Rules that look relevant are:
CPU utilization for appliances
CPU utilization
CPU utilization on Linux/Unix

Which one is applicable for Windows? I’m surprised that the CPU use rule is OS specific??

I can see “ESX CPU” is detected as a service, but it’s not obvious if I’m meant to do something with that, and there are no metrics. Is there somewhere I can see the instantaneous current CPU values?

Edit: I found the metrics - you have to go to “Service Search” and apply a filter with nothing in to see them all. I still don’t see anything to do with network usage on the ESXi servers though…

fig_wright · June 26, 2025, 10:55am

Unfortunately I am hitting a roadblock here, despite many hours of searching and reading. Although I found this excellent guide for setup ( https://www.youtube.com/watch?v=RHJpDpK2ACE ) there are few guides on how to set up actual monitoring and notification.

In the detected services of an ESXi server I am expecting to see e.g. “CPU utilization”, “Memory”, and several network interfaces, but they are not present. I see filesystems, but excluding filesystems and VMs I only see 6 services: Check_MK, Check_MK Discovery, ESX Snapshots Summary, HostSystem, Object count, System Time. All examples show many more than this! Why might this be?

When I look at the dashboard at Monitor => Applications => vSphere VMs, the CPU utilization and ESX Memory columns are blank. Why is this? Is it because those services don’t exist in my list above?

On the VMs I get the following list of services: Check_MK, Check_MK Discovery, ESX CPU, ESX Datastores, ESX Guest Tools, ESX Heartbeat, ESX Hostsystem, ESX Memory, ESX Mounted Devices, ESX Name, ESX Snapshots, Object count, VM . This seems a decent list, and both ESX CPU and ESX Memory contain performance data I can hover over in the service screen. I’d like to display them better though.

I followed the guide here ( Building a dashboard for vSphere monitoring in Checkmk ) to create a new view with “CPU” and “Memory” of the ESXi host, but those extra added columns are also always blank. I do the same for the VMs view - also blank; looking in the VM service list I see they are called “ESX CPU” and “ESX Memory”, but setting to that is blank in the VMs view also Why?

I am beginning to think that checkmk is just not detecting the expected data. The servers are Dell PowerEdge R640 and 630 running ESXi 7.0.3 so they should be good. The guide says “dashboarding is also part of the Checkmk Raw Edition (CRE), but some features like certain dashboard elements are only included in the CEE” - are performance details only in the CEE? Surely not?

I am suspecting that this is why my notifications are not working - because the underlying metrics don’t actually exist.

Unfortunately I am running out of time to get checkmk running now, having spent 3 days on this. I will shortly have to abandon and move to another monitoring software…

AutoJunkie · July 3, 2025, 5:17pm

Checkmk is not something you can learn, let alone optimize in 3 days. I’ve been using the raw edition for about 2 years to monitor 800 hosts (windows/linux/juniper/apc) with over 24k associated services and I’m still learning things (like how to build mobile-friendly views).

To answer your #1, yes, you will need to install the Checkmk Agent (and any necessary plugins) in order for you to see detailed info INSIDE a VM. Both ESX and VM guest OS monitoring have to be setup for a full picture of the entire stack. Look at it this way: are you able to see what processes the VM is running from vCenter? Vice versa, can you see how many snapshots have been taken from within the guest OS?

Here is an example of one of my VMs, with the metrics you are interested highlighted. With the Checkmk agent installed, I can see network interface, CPU utilization, and processor queue (and a whole lot more - with certain plugins):

Opening the CPU utilization service reveals details that you may be able to set thresholds for and the tracking of metrics/trends overtime will be useful for troubleshooting. You cannot see this level of detail from vCenter (e.g. ESX CPU):

From there, you can setup one of the cpu service monitoring rules to set thresholds for warning and critical events. These can be scoped by folder, tag, specific hosts, etc.:

As for the ESX hosts themselves, did you setup your VM agent rule for hosts (not vcenter) similar to this:

If so then your ESX host page should look like this:

Which lets me build a view (not dashboard) like this for my ESX hosts:

fig_wright · July 8, 2025, 3:20pm

Thank you very much for this detailed reply There’s a lot there, but particularly this section:

If I go to [Setup > Hosts > Host monitoring rules > Hosts to be monitored] I dont see a rule anything like that; instead I see a default rule with conditions: “Host tag: Criticality is not Do not monitor this host”. I didn’t add that rule. How do I add a rule like the one you have? Do you have a link to a how-to?

You have a lot more detected services in your ESXi server, which I was expecting to see but do not. There isn’t an agent to install on the ESXi server is there? Or is there an ESXi plugin?

AutoJunkie · July 10, 2025, 3:07pm

You have to go to Setup > Agents > VM, Cloud, Container. From there, you should see a VMware ESX via vSphere ruleset. Click on that and create two rules, one for the ESX Hosts and one for vCenters:

The two guides you linked in your original post are the best how-to resources, but you’ll have to adjust those to your environment and edition (raw vs. enterprise).

No, there is no separate agent or plugin for the ESX servers. Checkmk just queries the native API provided by VMware.

fig_wright · July 23, 2025, 2:23pm

I’ve had some more time to have another go at this. Thanks for your efforts! I think what you’ve done above is have different rules for ESXI servers and the vCentre server, but we aren’t using vCentre, so this probably isnt relevant to me.

I have managed to create a view that has the ESX CPU demand graph for every VM in it. This is really good and means I haven’t wasted my time But the lack of any network related services being discovered via ESX is a real blow and means we’ll likely move to another solution long-term. In the meantime I’ll have to install the agent to a number of VMs to make up for it; using the agent does at least find some network interfaces (and lots of other stuff I’m not much interested in).

Registering agents was a hassle - in the end I found I had to extract the password from the checkmk docker container /omd/sites/cmk/var/check_mk/web/automation/automation.secret
This is nowhere mentioned in the guide here: Monitoring Linux - The new agent for Linux in detail

Now I am trying to set up notifications again. For the VM with agent on I have successfully got network related email alerts - horay! However, I am perplexed that there isn’t a monitoring rule for ESX CPU. There are rules for “ESX GPU Utilization”, “ESX VM Memory usage”, “ESX Vsphere host system CPU utilization” (the last no use to me as I don’t have vCentre). If there is a rule for ESX GPU why isn’t there a rule for “ESX CPU”?

fig_wright · July 24, 2025, 10:03am

Also, while investigating monitoring rules, there appear to be some duplicates. E.g what is the difference between: “CPU Utilization”, and “CPU utilization on Linux/Unix” or “CPU utilization for simple devices” ? The former doesn’t appear to be OS specific, so what’s the point of the latter two?

What’s the difference between “Network interfaces and switch ports” and “Network IO”, as both allow the monitoring of bytes through interfaces with the same settings?

And how can I tell which monitoring rule is being triggered in these duplicate cases? I have set all of the above rules (as I didn’t know which CPU/Network rules would be the ones I wanted) but now I cannot tell which ones are being triggered! The alert emails don’t indicate which network rule, or which CPU rule caused it to be sent. The emails only tell me which interface is being alerted for, e.g. “Service: Interface 5”.

Finally, I’m getting some alerts on service “Number of threads”. I didn’t set that up, and when I go to that monitoring rule it says “There are no rules defined in this set.” What’s going on there?

fig_wright · July 25, 2025, 9:38am

Sigh. I’ve run into another problem: I created a new “Number of threads” service rule to try to over-ride what appears to be a default rule on this metric that you cannot change (there’s no help on how to do this, it seems). I immediately got an error:

Warnings:check_mk: ERROR: Duplicate service name (auto check) ‘Number of threads’ for host ‘hostname’! - 1st occurrence: check plug-in / item: bluecat_threads / None - 2nd occurrence: check plug-in / item: cpu_threads / None

So I deleted the new rule which appeared to clash…but the error now won’t go away. In addition, I now get a warning on every single VM:

[piggyback] Successfully processed from source ‘hostname’, Missing monitoring data for plugins, bluecat_threads WARN

I have no idea what “bluecat” is, and I don’t have any BlueCat hardware. Any ideas on how I can make all this nonsense go away? (other than nuking the CheckMK server and starting again)

andreas-doehler · July 25, 2025, 10:09am

It depends on the data that is available for the check.
The simplest example is only a value like “54”, what most times then mean 54% CPU utilization. Such a device most times uses the “CPU utilization” rule as this rule only provides a simple level set.
The rule “CPU utilization for simple devices” provides way more possibilities like IO wait, multi cores and averaging.

But in the end it is not relevant for you as user.

If you use the menu icon behind a service and then the “Parameters for this service” you will get to the correct rule set this check is using.

This has nothing to do with vCenter or not. As you have configured the special agent for you ESX host then you will see the CPU utilization on this host, and as it is the CPU utilization on ESX host it will use the rule “ESX Vsphere host system CPU utilization”. The name vSphere there inside the name has nothing to do with vCenter.

Here again as i said above. If you go to the service where you want to define a rule for it is way easier to get to the correct rule with “Parameters for this service”.

If there is nothing defined the check probably uses default/internal levels.
Inside the warning or critical message you should see why it is in this status. In your case too many threads or something like this. Now you can define a rule to your liking. Threads is a nice example where it is not possible to have a good default rule.

For such things it would be good to know what exactly was configured.
I think you defined a “Enforced services” rule and not a “Service monitoring rule”.

PS: it would be also good to have different threads for different problems

fig_wright · July 25, 2025, 3:00pm

Thank you for your help! Yes, somehow I had created an “Enforced service” on threads while trying to over-ride the default. After deleting it the error is gone and all my VM piggyback warnings are gone

I think I understand what you are saying on CPU measurements. I can see that on the VM with agent, I have both “CPU utilization” and “ESX CPU”; the former says the check is “CPU utilization on Linux/Unix” and the latter says “This check is not configurable via Setup”. On the VM without agent I only see the latter “ESX CPU”. So, none of the CPU checks claim to be using “ESX Vsphere host system CPU utilization”, and consequently I cannot set alarms for them (I have set limits in “ESX Vsphere host system CPU utilization” but they are never triggered). Am I maybe missing a service?

andreas-doehler · July 25, 2025, 3:16pm

This you only see on ESX host objects, not on VMs.
The “ESX CPU” service you see on VMs is really only for information. It is not a real check.
The problem there is you only see the used MHz/GHz value but no real usage in percent. The performance counter from vSphere side only reports this value but no maximum or anything like this.

fig_wright · July 28, 2025, 1:56pm

Hmm. I don’t see any sign of that service on the ESX hosts Seems likely this is another service that isn’t being detected, for whatever reason. I know what you mean, about the ESX CPU not being ideal, but if I could set limits against it I would - as the GHz value is fairly well related to (VM allocated #cores) x (2.5GHz) which I can obtain easily to estimate the alert limits.

Nevertheless, I have now obtained a minimum of info that I was looking for from the software, so thank you all for help on this