BUG: Mk_docker and Proxmox agent broken on cgroupv2 environments (container level): "Parsing of section docker_container_mem failed"

Hi folks!
On my CMK RAW Installation (2.0.0p7), I’m monitoring multiple Docker nodes, 2 of them are Debian 11, Kernel 5.10, Docker 20.10.7, Python 3.9, python3-docker 4.1.0. The agent Plugin is delivering more data and other keys in the JSON dict, than on the older Docker nodes.

This leads to the following issue on container level running on the newer machine:


The mk_docker agent on Debian 11 produces following output:

<<<docker_container_mem:sep(0):cached(1626459022,90)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "4.1.0", "ApiVersion": "1.41"}
{"usage": 164048896, "stats": {"active_anon": 0, "active_file": 10813440, "anon": 142491648, "anon_thp": 0, "file": 17977344, "file_dirty": 0, "file_mapped": 15273984, "file_writeback": 0, "inactive_anon": 144404480, "inactive_file": 7299072, "kernel_stack": 393216, "pgactivate": 3168, "pgdeactivate": 0, "pgfault": 58740, "pglazyfree": 0, "pglazyfreed": 0, "pgmajfault": 5643, "pgrefill": 0, "pgscan": 0, "pgsteal": 0, "shmem": 0, "slab": 790440, "slab_reclaimable": 525872, "slab_unreclaimable": 264568, "sock": 0, "thp_collapse_alloc": 0, "thp_fault_alloc": 0, "unevictable": 0, "workingset_activate": 0, "workingset_nodereclaim": 0, "workingset_refault": 0}, "limit": 16786518016}

Example of Agent output of a correctly working Host (Debian 10):

<<<docker_container_mem:sep(0):cached(1626459061,90)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "3.4.1", "ApiVersion": "1.41"}
{"usage": 230289408, "max_usage": 232685568, "stats": {"active_anon": 185704448, "active_file": 8626176, "cache": 15503360, "dirty": 0, "hierarchical_memory_limit": 9223372036854771712, "hierarchical_memsw_limit": 0, "inactive_anon": 0, "inactive_file": 6766592, "mapped_file": 12435456, "pgfault": 934560, "pgmajfault": 297, "pgpgin": 864765, "pgpgout": 815662, "rss": 185839616, "rss_huge": 0, "total_active_anon": 185704448, "total_active_file": 8626176, "total_cache": 15503360, "total_dirty": 0, "total_inactive_anon": 0, "total_inactive_file": 6766592, "total_mapped_file": 12435456, "total_pgfault": 934560, "total_pgmajfault": 297, "total_pgpgin": 864765, "total_pgpgout": 815662, "total_rss": 185839616, "total_rss_huge": 0, "total_unevictable": 0, "total_writeback": 0, "unevictable": 0, "writeback": 0}, "limit": 16821669888}

I already tested multiple versions of the docker python api (3.7.3,4.1.0 and 5.0.0), but this does not affect the output of the mk_docker plugin.
Do you have any ideas how to fix this issue?

Same Problem for me on 2.0.0p8cee with Debian 10

<<<docker_container_mem:sep(0)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "5.0.0", "ApiVersion": "1.41"}
{"usage": 51224576, "stats": {"active_anon": 7942144, "active_file": 8359936, "anon": 25845760, "anon_thp": 0, "file": 19324928, "file_dirty": 135168, "file_mapped": 1486848, "file_writeback": 270336, "inactive_anon": 18014208, "inactive_file": 11030528, "kernel_stack": 98304, "pgactivate": 2508, "pgdeactivate": 492161, "pgfault": 102137376, "pglazyfree": 16971141, "pglazyfreed": 462, "pgmajfault": 2508, "pgrefill": 514496, "pgscan": 2408497, "pgsteal": 2402854, "shmem": 0, "slab": 5476352, "slab_reclaimable": 3026944, "slab_unreclaimable": 2449408, "sock": 20480, "thp_collapse_alloc": 0, "thp_fault_alloc": 0, "unevictable": 0, "workingset_activate": 5181, "workingset_nodereclaim": 0, "workingset_refault": 75867}, "limit": 8341946368}

Eventually its related to the DockerD Version itself?

docker -v
Docker version 20.10.2, build 2291f61

Got the same problems after upgrading my Docker Hosts from Debian 10 to 11. Tried the latest CheckMK 2.0.0p11 and the latest Agent with the latest plugin. No changes.

This is probably related to debian switching to cgroupv2 with version 11.
There will be a fix for version 2.1: Werk #12310
A workaround could be to downgrade to cgroupv1 on the host running docker.

2 Likes

I can confirm that by configuring my Debian 11 system to use the legacy cgroup hierarchy / cgroupv1 the checks are working again. The containers still run as expected, I will leave it at that and wait for version 2.1 and change it back then.

I must say that this problem is little wider than only that the docker plugin is not working anymore. Also Proxmox with the actual version uses cgroupv2 and needs this fix. That this change is classified as new feature i cannot understand. This is a classic bugfix for Docker and Proxmox agents. It should be included in a normal patch release.

5 Likes

I changed the title and this should be included as written inside a normal patch release.
@LaMi

Switching to old cgroup is not an option with actual Proxmox setups.

We have already scheduled the backport of the change to the 2.0. The process will take some days, but should be done within one of the next patch releases.

3 Likes

Thanks @LaMi it was only unclear as the workaround is not possible in production environments :wink:
and there was not mentioned that it will also be included in 2.0 as the fix was already made in March.

All good. Thanks for the signal.

Backport is planned for current sprint. Cheers

3 Likes

Hey,

some news about this?
There were two new patches but still no sign about this.

Thanks for your hard work!

Best regards

It’s a bit more complicated than thought to solve this while ensuring that it doesn’t break anything (always our issue, you can’t just fix it like that, because we need to test every change a lot). Our admin is testing it at the moment and has discovered some new issues, which we now need to fix.

1 Like

This is exactly the point where it would be good to know what the problem in your testing environment is.
I tested this “workaround” with some Proxmox and had not any problem there.
At the moment “it looks like” no one does anything. I think you know what i mean :wink:

1 Like

I can assure you that we have easily put 20+ hours in fixing this. We are reworking the entire lxc and docker checks to the new Check-API as part of this fix. Thus not just doing a workaround, but a proper long-term fix, which helps us maintain this plug-in more easily in the future as well.

We are almost done with it.
Anyone interested in testing this? It works in our environment, but would be good, if we can get some users to try this out as well

1 Like

Hey,

As I am using proxmox and docker I would test it to see if it’s working in my enviroment.

Best regards

Live now with Checkmk 2.0.0p19 with Werk 12307

5 Likes