Monitor KVM VM resouces?

Setting up checkmk for the first time and one of the major things we’re wanting is to monitor all the VM resources our KVM machines are utilizing. CPU/Memory/disk/network so we can help manage and find those VMs that are using all the resources.

The goal is to have a dashboard with the host servers then dive into the VMs on its own dashboards, so if something’s using 300% of allocated cpu cores we can fix it. Especially with networking as if a bot is spamming we’d like to see that VM’s abnormal usage instead of whole host.

You can install the agent on each of the VM’s and also the KVM hosts to see where the resources are being consumed. This won’t give you a “KVM” native view of whether the vm’s are under/over subscribed but will give you visibility where the high consumers are.

What KVM platform are you using? Oracle OLVM or other?

1 Like

SolusVM as its a hosted situation. Not wanting to install anything inside every vm, was hoping to see it just from the host. I don’t need to see specifics inside each vm just generalized data so I know where to dig in.

I have an agent plugin that can be run on hosts using libvirt to run qemu-based VMs. It reports one service per VM with its current state, CPU usage, memory usage & assigned memory. Basically it’s a fancy wrapper around virsh list, virsh dominfo $domain & ps auxw | grep qemu. It offers two metrics: CPU utilization & memory usage. There are rules to configure limits for both for alerting purposes. Functionality’s certainly not extensive, but you’d see CPU hogging quite clearly.

If this is of interest to you, I’ll see that I package it for the Exchange.

2 Likes

Beside the approach from @mbunkus you can also have a look at the Solus API.

There it should be possible to get data from the single VMs like CPU/IO/Network usage.
But problem here is that there is no CheckMK integration existing for this API as far as i know.

I was looking for a solution built in just like how @mbunkus states and pulls using virsh.

One main concern is network and hunting botnets or abusers and such incase a service gets infected we can see the increase without having to put throttles everywhere. If I can figure this out on the host level I can monitor on all and then deep dive without digging through network logs or anything else.

New to checkmk but the goal is to use it as a resource to compare to previous so if there’s a random huge uptick we can be alerted of an issue, vs knowing there’s always been high usage.

Alright, give me a couple of days to get the repo set up & documentation written. I’ll shout when it’s available publicly.

2 Likes

That would be incredible!!! You’re my hero!!

Done. The plugin is called “libVirt VM status and resource usage” and has been submitted to the Exchange. Approval usually takes a few days.

In the meantime you can download it from the repository. I also highly suggest you read its documentation for usage instructions.

3 Likes

Not wanting to install anything inside every vm

I just want to say: This will be the best monitoring you can get, installing the agent.
It is lightweight, efficient, fast and has no dependencies. It is virtually a shell script.
Check it out, if you really want to understand how and why a VM is behaving as it is, the agent is the way to go.

yeah but we’re adding/removing and making changes all the time, some are just installed for minutes so doesn’t really make much sense. Also there’s thousands of them spread across dozens of hosts and will be tens of thousands, more like docker containers.. we just want some easy way to keep an eye if one goes crazy as many are exposed to WAN so can get attacked.

Most are designed to just get nuked and rebuilt with new logins if anything happens so not a huge deal but still trying to keep it safe. We’re not trying to host botnets or help spammers.

You could still install the agent during deployment of the VM and automatically add the host to Checkmk. Just saying. :slight_smile:

I’ve just pushed a new release to Codeberg (will be submitted to the Exchange, too, of course), with the following change:

Reworked CPU utilization to use data from ps (= the proc file system) instead of running top for 1s. This includes a different metric name. For existing RRDs for the service the effect is that the old metrics will still be visible but not updated anymore until either the metrics are removed from the RRDs manually or the whole RRDs are removed once.

This yields much better precision as the prior method only used a 1s window with top in which the current CPU usage was measured, whereas the new method uses the kernel’s own accounting information for the calculation (which is the same method as CheckMK’s own built-in calculations, e.g. when monitoring specific processes).

If you’re already running the old plugin, I suggest you remove the RRDs of the VM … services once after installing the plugin, re-baking agents & updating the deployed agent on the VM host.

Alternatively wait a day or two until I have more actual experience with running the modified version. This is not that well tested yet.

Oh, I might also add more services for network interfaces with metrics for read/write throughtput & error rates as those are easy enough to retrieve with virsh domifstat. That would result in one interface per interface configured for each VM, e.g. VM appserver1 interface vnet1.

Aaaaand here’s the release with support for network interfaces.

Block statistics might also be useful & easy enough to retrieve with domblkinfo + domblkstat. But I won’t have time to implement that for at least a week if not more.

Edit 2025-01-06: …that release was broken, I forgot to include a file. Silly me.

Just always get the latest from the repo, please :grin: