Monitor used vCPUs per ESXi host

bitwiz · April 20, 2021, 9:22am

I’ve reapplied the changes for werk #10627

10627 esx_vsphere_counters.cpu: Add new CPU usage check for ESX host systems

committed 08:47AM - 05 Dec 19 UTC

You can now monitor the CPU ready and Co-Stop values for ESX host systems. The …values are the percentage of time in a check interval that vCPU(s) are ready but are no physical CPU is free and assigned to them. VMWare calls this check interval 'realtime' which by default is equal to 20 seconds. The recommended VMWare levels are 5% for warning and 10% for critical. Change-Id: I8e68382c1a0e1989dca1da7f8208683b75ecc83e

to my Checkmk 1.6 instance and have been able to get the desired values just fine.

HOWEVER, I know why the werk has likely been reverted: the values reported are far too high, because they are summation values: a value like 85321.13 means that within the last 20 second interval, the sum of “CPU ready time” for all VMs running on this ESXi host at this point in time (!) is 85 seconds, cumulatively.

The calculation in the Checkmk python check accounts for the 20 second timeframe (therefore divides by 200 correctly), but ALSO needs to divide by the number of then-running VMs on this host, too!

In other words: if you’ve got 20 VMs running on an ESXi host and Checkmk reports (with the original werk calculation) a CPU READY value of 80%, then the real value is 4%

Unfortunately, I was unable to ascertain whether “number of active VMs on the host” is a variable easily accessible from within the calculation or not, therefore I have been unable to extend the check in a way to take the “running VM count” into account for the check.

Is this possible to add and therefore “fix” the check?

(EDIT: this also explains, by the way, why percentages > 100 % were possible as check results - WATO doesn’t allow adding percentages higher than 101.0, therefore I was wondering how/why Checkmk would report percentages of 321% for my test host)