I am monitoring several ESXi hosts and the corresponding virtual machines on them. One thing I actually do not find is the ability to monitor the number of currently used vCPUs per host/core. It would be very interesting to detect, when there are too many vCPUs/machines running per host.
Is this information already monitored and I just do not find it or is this not possible to monitor (yet)?
I’m afraid that kind of check is not yet implemented, but I think it’s a great idea!
The required basic information (number of host CPUs, number of vCPUs in each VM running on that host) should already be queried, so it should be possible to build a check from that.
IMHO for a real troubleshooting of esxi CPU the ready and costop values are of fundamental importance and its sad, that those are not included in checkmk.
Without those values you will always need to look at esxtop…
I will do a check in one of my smaller ESX systems if there is a difference between vCenter and host for this performance counter. If i have a little bit of time
Test against the ESX host and the vCenter was without problem.
All the counters returned values and the “cpu.usage” gave results for every single CPU core.
I don’t know what the problem was with the reverted werk.
We can only ask @jonas.kluger a he wrote the check
to my Checkmk 1.6 instance and have been able to get the desired values just fine.
HOWEVER, I know why the werk has likely been reverted: the values reported are far too high, because they are summation values: a value like 85321.13 means that within the last 20 second interval, the sum of “CPU ready time” for all VMs running on this ESXi host at this point in time (!) is 85 seconds, cumulatively.
The calculation in the Checkmk python check accounts for the 20 second timeframe (therefore divides by 200 correctly), but ALSO needs to divide by the number of then-running VMs on this host, too!
In other words: if you’ve got 20 VMs running on an ESXi host and Checkmk reports (with the original werk calculation) a CPU READY value of 80%, then the real value is 4%
Unfortunately, I was unable to ascertain whether “number of active VMs on the host” is a variable easily accessible from within the calculation or not, therefore I have been unable to extend the check in a way to take the “running VM count” into account for the check.
Is this possible to add and therefore “fix” the check?
(EDIT: this also explains, by the way, why percentages > 100 % were possible as check results - WATO doesn’t allow adding percentages higher than 101.0, therefore I was wondering how/why Checkmk would report percentages of 321% for my test host)
Nice findings. It should be possible to solve this problem if the check uses the data from another check.
The number of running machines is available inside the data from the ESX or vCenter.
Hi everybody! Im about to comment on said ticket
Dividing by the number of “poweredOn” VMs should not be an obstacle.
However: Say I have 20 VMs running, and then turn 18 of them off, just before the vCenter is queried.
That would mean I’d get the accumulated value of 20 VMs, but then devide it by 2. Will the resulting false positives be a problem that must be addressed? Or can we neglect that?
Yes indeed, if a lot of VMs are powered on at once we get “false negatives” (value is divided by larger amount of VMs than is warranted), if a lot of VMs are powered off at once we get “false positives”. (as in your example)
For our specific environment this would not be an issue, we don’t page anybody if such services were to turn red, but nevertheless the value reported would still be wrong, and by multitudes in your example.
I do question however how often this happens really. If I’ve got just 3 or 4 VMs on a host, where shutting down 2 at once would erroneously double the reported values, there is pretty much no risk for CPU oversubscription in the first place (what this check is supposed to monitor and to guard against).
On the other hand, we’re running 50-100 guests on a single ESXi and unless something major broke down we’d never shut down half of them at once (and if so we’ve got bigger issues to worry about than a single occurrence of a check result that displays double the expected value).
Welcome to hear others thoughts on this, for our environment this edge case would not be relevant.