System packet loss when Checkmk is running

CMK version: 2.1.0p27
OS version: Ubuntu 22.04 LTS

Hello everyone,

we are currently setting up a new Checkmk instance and have the problem that we run into up to 20% packet loss when Checkmk is running. We first tried to upgrade our old 1.6 Sites to 2.1 but had a few problems, including packet loss. Thats why we decided to create a clean install but have the same packet loss.

Our old Checkmk 1.6 site on a old Debian 7 server is in the same vlan and works perfectly fine.
Our new vm has 8GB RAM and 6 vCPUs with 6 checkers.

As soon as we stop the site we don’t have any packet loss anymore.

The packets loss happens with ping to public ip addresses (Cloudflare) and private ip addresses in the same network.

We currently have 116 hosts and 161 services configured.
All other sites are disabled.

Did anyone have a similar issue?

Please let me know what additional information is needed.

Thanks in advance!

Edit: If we disable host checks and service checks it works again.

Hello @adrianzech,
the amount of hosts should not be important to package lost. If you had to much hosts that are “unhandled” or not reachable, only the performance should be getting worse, if you don’t have enough handlers for them.

Did you check your interface and network? Check_MK SMART-Ping is pretty fast compared to normal ping. I would guess it’s more of a network/routing problem, if the other instances of checkmk in the same network don’t have this problem.
Probaply the new ip-adress makes problem for the network.

Haven’t heard of this problem related to checkmk, only hardware-based.

Hello @Kruzgoth,

thank you for your response!
Network problem was also my first thought, but if we have the Checkmk site running and disable host and service checks all packet loss disappears. Thats why i think its probably a checkmk problem.
I tried changing the ping bursts from 6 to 1 packet but with no luck.

Well, if you disable hosts and service checks, there is no more traffic :wink:
That explains the missing package lost.

There’s not much traffic going when the checks are running. We’re not even hitting 200kb/s but have 40% packet loss in ubuntu itself.

You did mention you are using a VM. Which hypervisor and which kernel are you running? I remember similar problems under Xen where simply grant table settings were insufficient.

We are running Nutanix AHV and kernel 5.15.0-71-generic. grant tabkles in Xen are for memory sharing correct? We have a few more Ubuntu 22.04 VMs with kernel 5.15.0-71-generic running on the same hv with no problems.

Yes, they are for memory sharing, especially for forwarding disk and network IO between the controlling domain and the unprivileged domains.

The host checks via smart ping might consume more grants than expected. Try to get into a setting where you are running into constant package loss, then check the kernel ring buffer for messages containing “grant”.

Thanks for the tip, but we couldn’t find anything about grant tables. But i might try contacting the Nutanix support if they know something.

Was this a kernel setting in xen you had to change to fix the grants problem?

Linux KVM in particular has packet loss issues with regards to virtual networking. AFAIK, this is an ongoing issue. Also, sometimes you don’t notice these things until you put on some extra load (however, in KVM you probably will see “some” regardless).

Sorry, “gnt” to grep. But some kernels might not report these issues. Changing grant table settings usually is done by setting boot parameters for the hypervisor. Please ask the Nutanix support how their default settings deal with many small packets.

Edit: I am unsure which hypervisor your solution uses and which network virtualization. We might go back and check what’s in /sys/hypervisor…

I checked again for “gnt” but still nothing. Might really be a kvm problems as Nutanix AHV is based on KVM.

I’ve opened a case with Nutanix and will update this post as soon as i know more.

Thank you!

You might want to check whether virtio is used for network interface emulation. As this person has experienced with rtl8139 (unlikely in your case, since only 100MBit/s) it might happen with e1000.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.