Monitor used vCPUs per ESXi host

Hi,

I am monitoring several ESXi hosts and the corresponding virtual machines on them. One thing I actually do not find is the ability to monitor the number of currently used vCPUs per host/core. It would be very interesting to detect, when there are too many vCPUs/machines running per host.

Is this information already monitored and I just do not find it or is this not possible to monitor (yet)?

Best regards,
UT2019

1 Like

I’m afraid that kind of check is not yet implemented, but I think it’s a great idea!
The required basic information (number of host CPUs, number of vCPUs in each VM running on that host) should already be queried, so it should be possible to build a check from that.

i dont see a benefit here.
if the ESX host cant handle the load, you will be alerted via the CPU ready and Co-Stop values of esx_vsphere_counters.cpu

But it might be that the total number of vCPUs is higher than the number of cores and at the same time the CPU utilization is still okay.

I don’t know that esx_vsphere_counters.cpu “CPU ready” and “Co-Stop” yet. Where can I find those?

see this post and Werk #10627
→ Version 1.7.0i1 (Not yet released)

nowadays its pretty normal to over provision CPU and Memory. so it makes more sense to look at metrics that tell us if VMs are waiting for resources.

sorry for the thread resurrection but the thread is not closed so…

@martin.schwarz @mace the werk #10627 that has been linked doesnt exist, was this implemented elsewhere? I cant find any reference to it

The werk was reverted here but without any real reason.

Reason: Issue can not be resolved at the moment.
The VMWare API does not provide the required values.

It is possible that you cannot get these counters over the vCenter. But there are no more information.

Pity! I would love to have co-stop in Checkmk.

2 Likes

IMHO for a real troubleshooting of esxi CPU the ready and costop values are of fundamental importance and its sad, that those are not included in checkmk.
Without those values you will always need to look at esxtop…

2 Likes

I will do a check in one of my smaller ESX systems if there is a difference between vCenter and host for this performance counter. If i have a little bit of time :slight_smile:

2 Likes

Made a short test in one of my ESX environments.

Added the following counters to the list of “REQUESTED_COUNTERS_KEYS”

cpu.ready
cpu.usage
cpu.usagemhz
cpu.demand
cpu.costop

Test against the ESX host and the vCenter was without problem.
All the counters returned values and the “cpu.usage” gave results for every single CPU core.
I don’t know what the problem was with the reverted werk.
We can only ask @jonas.kluger a he wrote the check :slight_smile:

1 Like

Where you put thoses values? In the Gui or directly to the plugin?

Regards,

This modification is done inside the special agent. But it will only fetch this counters.
There are no check at the moment using these counters then.

I’ve reapplied the changes for werk #10627

to my Checkmk 1.6 instance and have been able to get the desired values just fine.

HOWEVER, I know why the werk has likely been reverted: the values reported are far too high, because they are summation values: a value like 85321.13 means that within the last 20 second interval, the sum of “CPU ready time” for all VMs running on this ESXi host at this point in time (!) is 85 seconds, cumulatively.

The calculation in the Checkmk python check accounts for the 20 second timeframe (therefore divides by 200 correctly), but ALSO needs to divide by the number of then-running VMs on this host, too!

In other words: if you’ve got 20 VMs running on an ESXi host and Checkmk reports (with the original werk calculation) a CPU READY value of 80%, then the real value is 4%

Unfortunately, I was unable to ascertain whether “number of active VMs on the host” is a variable easily accessible from within the calculation or not, therefore I have been unable to extend the check in a way to take the “running VM count” into account for the check.

Is this possible to add and therefore “fix” the check?

(EDIT: this also explains, by the way, why percentages > 100 % were possible as check results - WATO doesn’t allow adding percentages higher than 101.0, therefore I was wondering how/why Checkmk would report percentages of 321% for my test host)

2 Likes

Nice findings. It should be possible to solve this problem if the check uses the data from another check.
The number of running machines is available inside the data from the ESX or vCenter.

I think this a little bit more work to do.

Hi all, as a shiny new enterprise customer we have logged a ticket for this.
We would really like to monitor these counters

Ticket is SUP-6145 for reference

2 Likes

let’s us known if you get more info on your ticket.

1 Like

Hi everybody! Im about to comment on said ticket :slight_smile:
Dividing by the number of “poweredOn” VMs should not be an obstacle.
However: Say I have 20 VMs running, and then turn 18 of them off, just before the vCenter is queried.
That would mean I’d get the accumulated value of 20 VMs, but then devide it by 2. Will the resulting false positives be a problem that must be addressed? Or can we neglect that?

1 Like

Yes indeed, if a lot of VMs are powered on at once we get “false negatives” (value is divided by larger amount of VMs than is warranted), if a lot of VMs are powered off at once we get “false positives”. (as in your example)

For our specific environment this would not be an issue, we don’t page anybody if such services were to turn red, but nevertheless the value reported would still be wrong, and by multitudes in your example.

I do question however how often this happens really. If I’ve got just 3 or 4 VMs on a host, where shutting down 2 at once would erroneously double the reported values, there is pretty much no risk for CPU oversubscription in the first place (what this check is supposed to monitor and to guard against).

On the other hand, we’re running 50-100 guests on a single ESXi and unless something major broke down we’d never shut down half of them at once (and if so we’ve got bigger issues to worry about than a single occurrence of a check result that displays double the expected value).

Welcome to hear others thoughts on this, for our environment this edge case would not be relevant.