BUG: Mk_docker and Proxmox agent broken on cgroupv2 environments (container level): "Parsing of section docker_container_mem failed"

Hi folks!
On my CMK RAW Installation (2.0.0p7), I’m monitoring multiple Docker nodes, 2 of them are Debian 11, Kernel 5.10, Docker 20.10.7, Python 3.9, python3-docker 4.1.0. The agent Plugin is delivering more data and other keys in the JSON dict, than on the older Docker nodes.

This leads to the following issue on container level running on the newer machine:


The mk_docker agent on Debian 11 produces following output:

<<<docker_container_mem:sep(0):cached(1626459022,90)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "4.1.0", "ApiVersion": "1.41"}
{"usage": 164048896, "stats": {"active_anon": 0, "active_file": 10813440, "anon": 142491648, "anon_thp": 0, "file": 17977344, "file_dirty": 0, "file_mapped": 15273984, "file_writeback": 0, "inactive_anon": 144404480, "inactive_file": 7299072, "kernel_stack": 393216, "pgactivate": 3168, "pgdeactivate": 0, "pgfault": 58740, "pglazyfree": 0, "pglazyfreed": 0, "pgmajfault": 5643, "pgrefill": 0, "pgscan": 0, "pgsteal": 0, "shmem": 0, "slab": 790440, "slab_reclaimable": 525872, "slab_unreclaimable": 264568, "sock": 0, "thp_collapse_alloc": 0, "thp_fault_alloc": 0, "unevictable": 0, "workingset_activate": 0, "workingset_nodereclaim": 0, "workingset_refault": 0}, "limit": 16786518016}

Example of Agent output of a correctly working Host (Debian 10):

<<<docker_container_mem:sep(0):cached(1626459061,90)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "3.4.1", "ApiVersion": "1.41"}
{"usage": 230289408, "max_usage": 232685568, "stats": {"active_anon": 185704448, "active_file": 8626176, "cache": 15503360, "dirty": 0, "hierarchical_memory_limit": 9223372036854771712, "hierarchical_memsw_limit": 0, "inactive_anon": 0, "inactive_file": 6766592, "mapped_file": 12435456, "pgfault": 934560, "pgmajfault": 297, "pgpgin": 864765, "pgpgout": 815662, "rss": 185839616, "rss_huge": 0, "total_active_anon": 185704448, "total_active_file": 8626176, "total_cache": 15503360, "total_dirty": 0, "total_inactive_anon": 0, "total_inactive_file": 6766592, "total_mapped_file": 12435456, "total_pgfault": 934560, "total_pgmajfault": 297, "total_pgpgin": 864765, "total_pgpgout": 815662, "total_rss": 185839616, "total_rss_huge": 0, "total_unevictable": 0, "total_writeback": 0, "unevictable": 0, "writeback": 0}, "limit": 16821669888}

I already tested multiple versions of the docker python api (3.7.3,4.1.0 and 5.0.0), but this does not affect the output of the mk_docker plugin.
Do you have any ideas how to fix this issue?

Same Problem for me on 2.0.0p8cee with Debian 10

<<<docker_container_mem:sep(0)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "5.0.0", "ApiVersion": "1.41"}
{"usage": 51224576, "stats": {"active_anon": 7942144, "active_file": 8359936, "anon": 25845760, "anon_thp": 0, "file": 19324928, "file_dirty": 135168, "file_mapped": 1486848, "file_writeback": 270336, "inactive_anon": 18014208, "inactive_file": 11030528, "kernel_stack": 98304, "pgactivate": 2508, "pgdeactivate": 492161, "pgfault": 102137376, "pglazyfree": 16971141, "pglazyfreed": 462, "pgmajfault": 2508, "pgrefill": 514496, "pgscan": 2408497, "pgsteal": 2402854, "shmem": 0, "slab": 5476352, "slab_reclaimable": 3026944, "slab_unreclaimable": 2449408, "sock": 20480, "thp_collapse_alloc": 0, "thp_fault_alloc": 0, "unevictable": 0, "workingset_activate": 5181, "workingset_nodereclaim": 0, "workingset_refault": 75867}, "limit": 8341946368}

Eventually its related to the DockerD Version itself?

docker -v
Docker version 20.10.2, build 2291f61

Got the same problems after upgrading my Docker Hosts from Debian 10 to 11. Tried the latest CheckMK 2.0.0p11 and the latest Agent with the latest plugin. No changes.

This is probably related to debian switching to cgroupv2 with version 11.
There will be a fix for version 2.1: Werk #12310
A workaround could be to downgrade to cgroupv1 on the host running docker.

1 Like

I can confirm that by configuring my Debian 11 system to use the legacy cgroup hierarchy / cgroupv1 the checks are working again. The containers still run as expected, I will leave it at that and wait for version 2.1 and change it back then.

I must say that this problem is little wider than only that the docker plugin is not working anymore. Also Proxmox with the actual version uses cgroupv2 and needs this fix. That this change is classified as new feature i cannot understand. This is a classic bugfix for Docker and Proxmox agents. It should be included in a normal patch release.

4 Likes

I changed the title and this should be included as written inside a normal patch release.
@LaMi

Switching to old cgroup is not an option with actual Proxmox setups.

We have already scheduled the backport of the change to the 2.0. The process will take some days, but should be done within one of the next patch releases.

2 Likes

Thanks @LaMi it was only unclear as the workaround is not possible in production environments :wink:
and there was not mentioned that it will also be included in 2.0 as the fix was already made in March.

All good. Thanks for the signal.

Backport is planned for current sprint. Cheers

3 Likes