Proxmox LXC CPU Monitoring goes over 100%

CheckMK version: Cloud Edition 2.3.0p20

Hello,
I’ve been trying to monitor the stats of my Proxmox containers, but I’ve been experiencing some anomalies regarding the ingestion of CPU metrics:

As you can see, the CPU utilization goes over 100% and it hits peaks of 417%.

As far as I know, the CPU utilization is not a metric given by the piggyback of the pve agent (It only returns memory and disk usage)

So I just installed a normal agent inside the LXC and got the metric from there.

From what I can see, the CPU utilization is correctly reported by Proxmox:
image

nor I see any anomalies from htop inside the container:

Are these some kind of limitations related to monitoring an LXC, or is there something wrong with my setup?

Thanks,
Nicole

From the top of my mind I think something similar can happen for Docker containers, where the way the check works creates this kind of data. The one thing I am not sure about anymore is, whether this actually makes sense and if there is a good explanation, or not. :grimacing:

Hi Robin,
so, I did some testing, here’s the results:

First, I’ve downloaded the agent output installed on the LXC, so that I can get some insight on what’s going on.

This is the section that I’ve found in the output:

<<<lxc_container_cpu_cgroupv2>>>
uptime 60573.54 60573.54
num_cpus 8
usage_usec 179648197122
user_usec 154607125365
system_usec 25041071757
core_sched.force_idle_usec 0
nr_periods 0
nr_throttled 0
throttled_usec 0
nr_bursts 0
burst_usec 0

The number of cpus is correct, and then the part that comes after that is basically the output of /sys/fs/cgroup/cpu.stat
image

So, it seems like the agent is reporting things correctly.

Next, I’ve made a simple script that converts the usage_usec which represents the total CPU time (in microseconds) consumed by tasks in the cgroup, into a readable percentage.
In order to do that, I simply take 2 samples in a time span (1 second in this case), calculate the delta of the usage and the time, and then use the following formula:

cpu_percentage = (delta_usage / 1000000) / (delta_time * num_cpus) * 100

That’s what it looks like as a script:

#!/bin/bash

# Get the number of CPUs
num_cpus=$(nproc)

# Read the initial usage
usage1=$(cat /sys/fs/cgroup/cpu.stat | grep 'usage_usec' | awk '{print $2}')
time1=$(date +%s)

# Wait 1 second
sleep 1

# Read the second usage
usage2=$(cat /sys/fs/cgroup/cpu.stat | grep 'usage_usec' | awk '{print $2}')
time2=$(date +%s)

# Calculate deltas
delta_usage=$((usage2 - usage1))
delta_time=$((time2 - time1))

# Calculate CPU usage percentage
cpu_usage=$(awk "BEGIN {print ($delta_usage / 1000000) / ($delta_time * $num_cpus) * 100}")

echo "CPU Usage: $cpu_usage%"

Let’s test this out:


It seems like there’s nothing strange with the output of the cpu.stat, the percentage is correctly reported and it matches what I see in real time in Proxmox.

So… let’s dig a little deeper:
this is the CPU usage on Proxmox:

and this is the CPU usage on CheckMK:

There’s a clear pattern that suggests the metrics are somewhat correct, but there is some sort of transformation being applied to it.

I’ve ingested the metrics in Grafana and did some random tests.

Original metric:

Same metric, divided by 8:

Would you look at that, that’s the correct CPU usage :smile:

So what I discovered is that CheckMK is getting the metrics right, but it’s multiplying the actual CPU usage by the amount of cores assigned to the container

Why is CheckMK doing this calculation only to LXC containers? That’s what I’m curious about :melting_face:

This more and more seems to be related to the issue I had in mind.

Thanks for the extensive research! :pray:

One quick question: Did this behavior change after an update, or was it always like this?

To the best of my knowledge, it has been this way since I began monitoring Proxmox. However, I haven’t been tracking it for many versions—only since version 2.2