BUG: Mk_docker and Proxmox agent broken on cgroupv2 environments (container level): "Parsing of section docker_container_mem failed"

thorsten.spille · July 16, 2021, 8:00pm

Hi folks!
On my CMK RAW Installation (2.0.0p7), I’m monitoring multiple Docker nodes, 2 of them are Debian 11, Kernel 5.10, Docker 20.10.7, Python 3.9, python3-docker 4.1.0. The agent Plugin is delivering more data and other keys in the JSON dict, than on the older Docker nodes.

This leads to the following issue on container level running on the newer machine:

The mk_docker agent on Debian 11 produces following output:

<<<docker_container_mem:sep(0):cached(1626459022,90)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "4.1.0", "ApiVersion": "1.41"}
{"usage": 164048896, "stats": {"active_anon": 0, "active_file": 10813440, "anon": 142491648, "anon_thp": 0, "file": 17977344, "file_dirty": 0, "file_mapped": 15273984, "file_writeback": 0, "inactive_anon": 144404480, "inactive_file": 7299072, "kernel_stack": 393216, "pgactivate": 3168, "pgdeactivate": 0, "pgfault": 58740, "pglazyfree": 0, "pglazyfreed": 0, "pgmajfault": 5643, "pgrefill": 0, "pgscan": 0, "pgsteal": 0, "shmem": 0, "slab": 790440, "slab_reclaimable": 525872, "slab_unreclaimable": 264568, "sock": 0, "thp_collapse_alloc": 0, "thp_fault_alloc": 0, "unevictable": 0, "workingset_activate": 0, "workingset_nodereclaim": 0, "workingset_refault": 0}, "limit": 16786518016}

Example of Agent output of a correctly working Host (Debian 10):

<<<docker_container_mem:sep(0):cached(1626459061,90)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "3.4.1", "ApiVersion": "1.41"}
{"usage": 230289408, "max_usage": 232685568, "stats": {"active_anon": 185704448, "active_file": 8626176, "cache": 15503360, "dirty": 0, "hierarchical_memory_limit": 9223372036854771712, "hierarchical_memsw_limit": 0, "inactive_anon": 0, "inactive_file": 6766592, "mapped_file": 12435456, "pgfault": 934560, "pgmajfault": 297, "pgpgin": 864765, "pgpgout": 815662, "rss": 185839616, "rss_huge": 0, "total_active_anon": 185704448, "total_active_file": 8626176, "total_cache": 15503360, "total_dirty": 0, "total_inactive_anon": 0, "total_inactive_file": 6766592, "total_mapped_file": 12435456, "total_pgfault": 934560, "total_pgmajfault": 297, "total_pgpgin": 864765, "total_pgpgout": 815662, "total_rss": 185839616, "total_rss_huge": 0, "total_unevictable": 0, "total_writeback": 0, "unevictable": 0, "writeback": 0}, "limit": 16821669888}

I already tested multiple versions of the docker python api (3.7.3,4.1.0 and 5.0.0), but this does not affect the output of the mk_docker plugin.
Do you have any ideas how to fix this issue?

CyberLine · July 21, 2021, 6:52am

Same Problem for me on 2.0.0p8cee with Debian 10

<<<docker_container_mem:sep(0)>>>
@docker_version_info{"PluginVersion": "0.1", "DockerPyVersion": "5.0.0", "ApiVersion": "1.41"}
{"usage": 51224576, "stats": {"active_anon": 7942144, "active_file": 8359936, "anon": 25845760, "anon_thp": 0, "file": 19324928, "file_dirty": 135168, "file_mapped": 1486848, "file_writeback": 270336, "inactive_anon": 18014208, "inactive_file": 11030528, "kernel_stack": 98304, "pgactivate": 2508, "pgdeactivate": 492161, "pgfault": 102137376, "pglazyfree": 16971141, "pglazyfreed": 462, "pgmajfault": 2508, "pgrefill": 514496, "pgscan": 2408497, "pgsteal": 2402854, "shmem": 0, "slab": 5476352, "slab_reclaimable": 3026944, "slab_unreclaimable": 2449408, "sock": 20480, "thp_collapse_alloc": 0, "thp_fault_alloc": 0, "unevictable": 0, "workingset_activate": 5181, "workingset_nodereclaim": 0, "workingset_refault": 75867}, "limit": 8341946368}

Eventually its related to the DockerD Version itself?

docker -v
Docker version 20.10.2, build 2291f61

drBeam · September 29, 2021, 1:13pm

Got the same problems after upgrading my Docker Hosts from Debian 10 to 11. Tried the latest CheckMK 2.0.0p11 and the latest Agent with the latest plugin. No changes.

BenediktSeidl · September 30, 2021, 5:54am

This is probably related to debian switching to cgroupv2 with version 11.
There will be a fix for version 2.1: Werk #12310
A workaround could be to downgrade to cgroupv1 on the host running docker.

lkoenig · October 3, 2021, 10:51am

I can confirm that by configuring my Debian 11 system to use the legacy cgroup hierarchy / cgroupv1 the checks are working again. The containers still run as expected, I will leave it at that and wait for version 2.1 and change it back then.

andreas-doehler · October 3, 2021, 11:29am

I must say that this problem is little wider than only that the docker plugin is not working anymore. Also Proxmox with the actual version uses cgroupv2 and needs this fix. That this change is classified as new feature i cannot understand. This is a classic bugfix for Docker and Proxmox agents. It should be included in a normal patch release.

andreas-doehler · November 1, 2021, 4:23pm

I changed the title and this should be included as written inside a normal patch release.
@LaMi

Switching to old cgroup is not an option with actual Proxmox setups.

LaMi · November 2, 2021, 7:00am

We have already scheduled the backport of the change to the 2.0. The process will take some days, but should be done within one of the next patch releases.

andreas-doehler · November 2, 2021, 8:02am

Thanks @LaMi it was only unclear as the workaround is not possible in production environments
and there was not mentioned that it will also be included in 2.0 as the fix was already made in March.

LaMi · November 2, 2021, 8:45am

All good. Thanks for the signal.

martin.hirschvogel · November 16, 2021, 11:13am

Backport is planned for current sprint. Cheers

ampfinger · December 9, 2021, 8:24am

Hey,

some news about this?
There were two new patches but still no sign about this.

Thanks for your hard work!

Best regards

martin.hirschvogel · December 9, 2021, 9:24am

It’s a bit more complicated than thought to solve this while ensuring that it doesn’t break anything (always our issue, you can’t just fix it like that, because we need to test every change a lot). Our admin is testing it at the moment and has discovered some new issues, which we now need to fix.

andreas-doehler · December 15, 2021, 8:45pm

This is exactly the point where it would be good to know what the problem in your testing environment is.
I tested this “workaround” with some Proxmox and had not any problem there.
At the moment “it looks like” no one does anything. I think you know what i mean

martin.hirschvogel · December 16, 2021, 6:12am

I can assure you that we have easily put 20+ hours in fixing this. We are reworking the entire lxc and docker checks to the new Check-API as part of this fix. Thus not just doing a workaround, but a proper long-term fix, which helps us maintain this plug-in more easily in the future as well.

martin.hirschvogel · January 10, 2022, 1:29pm

We are almost done with it.
Anyone interested in testing this? It works in our environment, but would be good, if we can get some users to try this out as well

ampfinger · January 10, 2022, 2:26pm

Hey,

As I am using proxmox and docker I would test it to see if it’s working in my enviroment.

Best regards

martin.hirschvogel · January 28, 2022, 5:34pm

Live now with Checkmk 2.0.0p19 with Werk 12307

system · January 28, 2023, 5:34pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.