Monitor used vCPUs per ESXi host

UT2019 · February 12, 2020, 9:01am

Hi,

I am monitoring several ESXi hosts and the corresponding virtual machines on them. One thing I actually do not find is the ability to monitor the number of currently used vCPUs per host/core. It would be very interesting to detect, when there are too many vCPUs/machines running per host.

Is this information already monitored and I just do not find it or is this not possible to monitor (yet)?

Best regards,
UT2019

martin.schwarz · February 12, 2020, 9:33am

I’m afraid that kind of check is not yet implemented, but I think it’s a great idea!
The required basic information (number of host CPUs, number of vCPUs in each VM running on that host) should already be queried, so it should be possible to build a check from that.

mace · February 12, 2020, 10:21am

i dont see a benefit here.
if the ESX host cant handle the load, you will be alerted via the CPU ready and Co-Stop values of esx_vsphere_counters.cpu

UT2019 · February 12, 2020, 10:52am

But it might be that the total number of vCPUs is higher than the number of cores and at the same time the CPU utilization is still okay.

I don’t know that esx_vsphere_counters.cpu “CPU ready” and “Co-Stop” yet. Where can I find those?

martin.schwarz · February 12, 2020, 1:16pm

see this post and Werk #10627
→ Version 1.7.0i1 (Not yet released)

mace · February 12, 2020, 1:17pm

nowadays its pretty normal to over provision CPU and Memory. so it makes more sense to look at metrics that tell us if VMs are waiting for resources.

GarthH · April 11, 2021, 9:49pm

sorry for the thread resurrection but the thread is not closed so…

@martin.schwarz @mace the werk #10627 that has been linked doesnt exist, was this implemented elsewhere? I cant find any reference to it

andreas-doehler · April 12, 2021, 5:13am

The werk was reverted here but without any real reason.

Reason: Issue can not be resolved at the moment.
The VMWare API does not provide the required values.

It is possible that you cannot get these counters over the vCenter. But there are no more information.

martin.schwarz · April 12, 2021, 7:26am

Pity! I would love to have co-stop in Checkmk.

aeckstein · April 12, 2021, 7:42am

IMHO for a real troubleshooting of esxi CPU the ready and costop values are of fundamental importance and its sad, that those are not included in checkmk.
Without those values you will always need to look at esxtop…

andreas-doehler · April 12, 2021, 8:10am

I will do a check in one of my smaller ESX systems if there is a difference between vCenter and host for this performance counter. If i have a little bit of time

andreas-doehler · April 12, 2021, 1:20pm

Made a short test in one of my ESX environments.

Added the following counters to the list of “REQUESTED_COUNTERS_KEYS”

cpu.ready
cpu.usage
cpu.usagemhz
cpu.demand
cpu.costop

Test against the ESX host and the vCenter was without problem.
All the counters returned values and the “cpu.usage” gave results for every single CPU core.
I don’t know what the problem was with the reverted werk.
We can only ask @jonas.kluger a he wrote the check

DominiqueArpin · April 15, 2021, 7:41pm

Where you put thoses values? In the Gui or directly to the plugin?

Regards,

andreas-doehler · April 15, 2021, 8:34pm

This modification is done inside the special agent. But it will only fetch this counters.
There are no check at the moment using these counters then.

bitwiz · April 20, 2021, 9:22am

I’ve reapplied the changes for werk #10627

github.com/tribe29/checkmk

10627 esx_vsphere_counters.cpu: Add new CPU usage check for ESX host systems

committed 08:47AM - 05 Dec 19 UTC

+224 -0

You can now monitor the CPU ready and Co-Stop values for ESX host systems. The …values are the percentage of time in a check interval that vCPU(s) are ready but are no physical CPU is free and assigned to them. VMWare calls this check interval 'realtime' which by default is equal to 20 seconds. The recommended VMWare levels are 5% for warning and 10% for critical. Change-Id: I8e68382c1a0e1989dca1da7f8208683b75ecc83e

to my Checkmk 1.6 instance and have been able to get the desired values just fine.

HOWEVER, I know why the werk has likely been reverted: the values reported are far too high, because they are summation values: a value like 85321.13 means that within the last 20 second interval, the sum of “CPU ready time” for all VMs running on this ESXi host at this point in time (!) is 85 seconds, cumulatively.

The calculation in the Checkmk python check accounts for the 20 second timeframe (therefore divides by 200 correctly), but ALSO needs to divide by the number of then-running VMs on this host, too!

In other words: if you’ve got 20 VMs running on an ESXi host and Checkmk reports (with the original werk calculation) a CPU READY value of 80%, then the real value is 4%

Unfortunately, I was unable to ascertain whether “number of active VMs on the host” is a variable easily accessible from within the calculation or not, therefore I have been unable to extend the check in a way to take the “running VM count” into account for the check.

Is this possible to add and therefore “fix” the check?

(EDIT: this also explains, by the way, why percentages > 100 % were possible as check results - WATO doesn’t allow adding percentages higher than 101.0, therefore I was wondering how/why Checkmk would report percentages of 321% for my test host)

andreas-doehler · April 20, 2021, 9:32am

Nice findings. It should be possible to solve this problem if the check uses the data from another check.
The number of running machines is available inside the data from the ESX or vCenter.

I think this a little bit more work to do.

GarthH · April 23, 2021, 12:21am

Hi all, as a shiny new enterprise customer we have logged a ticket for this.
We would really like to monitor these counters

Ticket is SUP-6145 for reference

DominiqueArpin · April 28, 2021, 1:14pm

let’s us known if you get more info on your ticket.

moritz · May 6, 2021, 1:04pm

Hi everybody! Im about to comment on said ticket
Dividing by the number of “poweredOn” VMs should not be an obstacle.
However: Say I have 20 VMs running, and then turn 18 of them off, just before the vCenter is queried.
That would mean I’d get the accumulated value of 20 VMs, but then devide it by 2. Will the resulting false positives be a problem that must be addressed? Or can we neglect that?

bitwiz · May 6, 2021, 1:14pm

Yes indeed, if a lot of VMs are powered on at once we get “false negatives” (value is divided by larger amount of VMs than is warranted), if a lot of VMs are powered off at once we get “false positives”. (as in your example)

For our specific environment this would not be an issue, we don’t page anybody if such services were to turn red, but nevertheless the value reported would still be wrong, and by multitudes in your example.

I do question however how often this happens really. If I’ve got just 3 or 4 VMs on a host, where shutting down 2 at once would erroneously double the reported values, there is pretty much no risk for CPU oversubscription in the first place (what this check is supposed to monitor and to guard against).

On the other hand, we’re running 50-100 guests on a single ESXi and unless something major broke down we’d never shut down half of them at once (and if so we’ve got bigger issues to worry about than a single occurrence of a check result that displays double the expected value).

Welcome to hear others thoughts on this, for our environment this edge case would not be relevant.