Basic Checkmk monitoring with ESXi and VMs

Hi all, new user here.

I am trying to use Checkmk to do very basic system monitoring of a handful of ESXi servers with a couple of dozen VMs on. All I want to monitor is average CPU and network usage, so I can set alerts for unusually high levels - e.g. CPU >95% or >10GB network usage in a period of 12 hours.

I’m finding the web interface quite difficult to understand and more complicated than I was expecting. I have found the following two guides:

Using them I have added an ESXi server, created a connection rule, and connected successfully. I now see in the services: my VMs on the server, and some other stuff like filesystems and “Object count”. The two guides above then go into details on “piggy-backing” all the VMs, which it isn’t clear that I need if all I want to measure is VM CPU and network use…

Can someone please tell me:

  1. Do I need to do all the piggy-backing and set up the list of VMs just to monitor CPU and network use of VMs?
  2. Can anyone point to a guide that will show me how to set up CPU and network monitoring and simple alerts based on them? Do I need to install checkmk plugins?
1 Like

Create a host in Checkmk with the same name of one of the VMs that you want to monitor then run the Discovery and see if the data appears? To get more precise data you might need to install checkmk agent in the VMs that you want to monitor.

2 Likes

Actually, I’ve just realised that Dynamic host management is not available in Checkmk Raw, but only the Enterprise version anyway. It’s not a big deal to create the VMs manually, as I only have a dozen. If I manually create a host with name of a VM, it does detect it and shows e.g. the following services being monitored:

ESX CPU
ESX Datastores
ESX Guest Tools
ESX Heartbeat
ESX Hostsystem
ESX Memory
ESX Mounted Devices
ESX Name
ESX Snapshots

I don’t see network of any kind in there, but there is CPU. How do I create an alert if CPU is over 95% for 12 hours?

1 Like

Also, I notice all the VMs seem to have the following error:

[API/agent]: host configuration requires a datasource but none configure

I can’t see anywhere to add this “datasource” ?

For the data source error you need to set the host to “no agent” inside the host configuration.

Network statistics are not available for the single VMs only for ESX hosts.

For the CPU notification you can create first a rule that should go critical at 95% and then you need to create a notification delay rule for the wanted time. That means that a notification is only sent if the service stays warn/crit for the selected time in the rule.

1 Like

Thanks for that info! I’ve got rid of the data source error now. It’s really not clear in the instructions that ESXi VMs should have no API or agent set.

I’ve set a CPU notification, but am not getting any emails now. I can see the rule is being hit because in the event dashboard I can see “Spooled mail to local mail transmission agent”. After a while I have finally added our mail relay host to the docket setup (amazingly, this isn’t specified in the docker setup guide at Installation as a Docker container ) with the command
-e MAIL_RELAY_HOST=‘mailrelay.mydomain.com’ (obvs putting our domain in) but I’m still not getting the email. Any thoughts?

It seems strange to me that this isn’t an internal setting from the interface, but whatever. Is there somewhere I can test the email? All the tests I can find seem to be testing external smtp services, which isn’t my issue.

Edit2: Ignore the above quote, the email just took a long time to arrive and ended up in spam :smiley:

1 Like

Okay, I’m trying to set up a CPU rule. I have both Linux and Windows VMs. Rules that look relevant are:
CPU utilization for appliances
CPU utilization
CPU utilization on Linux/Unix

Which one is applicable for Windows? I’m surprised that the CPU use rule is OS specific??

I can see “ESX CPU” is detected as a service, but it’s not obvious if I’m meant to do something with that, and there are no metrics. Is there somewhere I can see the instantaneous current CPU values?

Edit: I found the metrics - you have to go to “Service Search” and apply a filter with nothing in to see them all. I still don’t see anything to do with network usage on the ESXi servers though…

1 Like

Unfortunately I am hitting a roadblock here, despite many hours of searching and reading. Although I found this excellent guide for setup ( https://www.youtube.com/watch?v=RHJpDpK2ACE ) there are few guides on how to set up actual monitoring and notification.

In the detected services of an ESXi server I am expecting to see e.g. “CPU utilization”, “Memory”, and several network interfaces, but they are not present. I see filesystems, but excluding filesystems and VMs I only see 6 services: Check_MK, Check_MK Discovery, ESX Snapshots Summary, HostSystem, Object count, System Time. All examples show many more than this! Why might this be?

When I look at the dashboard at Monitor => Applications => vSphere VMs, the CPU utilization and ESX Memory columns are blank. Why is this? Is it because those services don’t exist in my list above?

On the VMs I get the following list of services: Check_MK, Check_MK Discovery, ESX CPU, ESX Datastores, ESX Guest Tools, ESX Heartbeat, ESX Hostsystem, ESX Memory, ESX Mounted Devices, ESX Name, ESX Snapshots, Object count, VM . This seems a decent list, and both ESX CPU and ESX Memory contain performance data I can hover over in the service screen. I’d like to display them better though.

I followed the guide here ( Building a dashboard for vSphere monitoring in Checkmk ) to create a new view with “CPU” and “Memory” of the ESXi host, but those extra added columns are also always blank. I do the same for the VMs view - also blank; looking in the VM service list I see they are called “ESX CPU” and “ESX Memory”, but setting to that is blank in the VMs view also :frowning: Why?

I am beginning to think that checkmk is just not detecting the expected data. The servers are Dell PowerEdge R640 and 630 running ESXi 7.0.3 so they should be good. The guide says “dashboarding is also part of the Checkmk Raw Edition (CRE), but some features like certain dashboard elements are only included in the CEE” - are performance details only in the CEE? Surely not?

I am suspecting that this is why my notifications are not working - because the underlying metrics don’t actually exist.

Unfortunately I am running out of time to get checkmk running now, having spent 3 days on this. I will shortly have to abandon and move to another monitoring software… :frowning:

Checkmk is not something you can learn, let alone optimize in 3 days. I’ve been using the raw edition for about 2 years to monitor 800 hosts (windows/linux/juniper/apc) with over 24k associated services and I’m still learning things (like how to build mobile-friendly views).

To answer your #1, yes, you will need to install the Checkmk Agent (and any necessary plugins) in order for you to see detailed info INSIDE a VM. Both ESX and VM guest OS monitoring have to be setup for a full picture of the entire stack. Look at it this way: are you able to see what processes the VM is running from vCenter? Vice versa, can you see how many snapshots have been taken from within the guest OS?

Here is an example of one of my VMs, with the metrics you are interested highlighted. With the Checkmk agent installed, I can see network interface, CPU utilization, and processor queue (and a whole lot more - with certain plugins):

Opening the CPU utilization service reveals details that you may be able to set thresholds for and the tracking of metrics/trends overtime will be useful for troubleshooting. You cannot see this level of detail from vCenter (e.g. ESX CPU):

From there, you can setup one of the cpu service monitoring rules to set thresholds for warning and critical events. These can be scoped by folder, tag, specific hosts, etc.:

As for the ESX hosts themselves, did you setup your VM agent rule for hosts (not vcenter) similar to this:

If so then your ESX host page should look like this:

Which lets me build a view (not dashboard) like this for my ESX hosts:

2 Likes