Raspberry Pi 4 freezes for 1 hour after change commits

**CMK version:2.1.0p13 raw edition (compiled from chrisss404 on github)
**OS version:Raspbian GNU/Linux 11 (bullseye)

Hello,

I’m experimenting how far I can push a Raspberry Pi 4 as a Distributed Monitoring Slave Site in terms of Services and Hosts.

SETUP:
I manage to run 100 Hosts and 3000 Services on a Raspberry Pi 4 Compute Module with active cooling, OC is 1,8Ghz, 2GB RAM and on a Sandisk Extreme Pro 32GB.

Half of the Checks are SNMP and the other half Checkmk Agents.

Some of the performance tweaks I did is only allowing 2 simultaneous checks in the nagios.cfg and only allowing 8 apache processes. The check intervals are 15 minutes for snmp and 10 minutes for agents. Software Inventory is done once per day and discovery only manually. “Delay precompiling of host checks” is also on.

With all those tweaks, the average load of the RPI4 is around 0.8 in 15 Minutes.

ERROR:
This is where things get weird. When I commit a change, the slave site SOMETIMES goes “dead” for 10 minutes up to 3 hours.
In that time, the whole OS of the Raspberry Pi freezes, only Ping works. Sometimes there are performance numbers still collected from the CheckMK Agent on the Site itself, and it shows that the CPU is on like 40 average load for that time. I cant see any higher RAM utilization than normally. The site only uses like 1G RAM and theres always 700M free RAM. The Site also isn’t actively swapping.

At some point, eventually, the site comes back and the commit was successful and everything works.

QUESTION:
Does anybody know what a commit does, that totally crushes the CPU of the Raspberry Pi 4? It only happens sometimes. The first one or two commits after a fresh reboot are fine most of the time. I would really like to be able to tweak the commit process if possible.

The flash storage card will not be able to cope with the IO that is generated by checkmk.

Apart from this error situation you will see a rapid weardown of that card. Checkmk is constantly writing to it.

Thank you for your answer.

So the commit generates a lot of I/O which the SD Card cannot cope with. That makes sense, since the OS doesn’t have any ressources left and then freezes for a while. I wonder which part of the commit generates most of the I/O.

I’m not that concerned about the weardown of the SD card. The SanDisk Extreme and many other SD Cards seem to have wear leveling like SSDs since a few years. They can be overwritten daily for many years.

I have successfuly reduced the freezing time.

The amount of I/O seems to be generated by having rules in the main folder. I had a ruleset of 300 rules, half of them being in the main folder and matched by host tags.

By putting the rules in folders as close to the matching host as possible, the performance was improved by a factor of 10.

The commit time and freeze was reduced from around 1-2 hours to just 5-10 minutes.

1 Like

Total noob here WRT Checkmk but I have a lot of experience with Raspberry Pis including the Compute module (CM4.)

I highly recommend you run off an SSD rather than an SD card. Performance for anything that involves disk I/O will benefit a lot. In addition, an SSD will far outlive an SD card in this environment.

One of the things I really like about the CM4 is that it exposes a PCIe lane and you can get a PCIe 1x/NVME adapter for the official I/O board or use something like the “Ether Board” which provides a small board with NVME slot (limited to 2230 form factor.)

Incidentally, I found this because I was looking to see if I can use Checkmk to monitor a handful of Raspberry Pis as well as some of my other stuff.

HTH

1 Like

Thank you for that suggestion. I will equip the compute module with an nvme SSD and partition /omd to that drive, see if that brings even more improvements.

With my above fix it still runs off the SD card, I have like 1 hour downtime per week when the site does some I/O heavy things.

I have tested the site on a CM4 lite 4GB, inside a Waveshare CM4-IO-BASE-B with an NVME directly inserted to the PCIe1x Lane.

I’ve tested it for around a month, and it sadly still displays the same behavior. Sometimes it just freezes for an hour or two. I have noticed that it mostly happens when you open up the main Checkmk Website, or commit changes. Then the distributed monitoring site goes dead for a while.

I don’t think it’s an I/O problem. The NVME SSD has random 4k I/O Read 28k and Write 28k.

My suspicion is the rrd database files. Probably when the retention of the data happens, the whole site freezes for a while. The CPU of the Raspberry Pi probably isn’t fast enough for that workload once there are more than around 2000 services.