hi there, sorry for the late response; had some vacation time.
will do the updates asap.
i have only one checkmk-server.
so the thing confusing me, is that the non-availability of 1 of my 9 pve-nodes is somehow crippling the monitoring of the 8 remaining (timeout).
the Request failed. (Too many active connections)
was more or less the only thing i found on the remaining, non-functional nodes.
it may be the the disk failure (we mitigated the issue by swapping sata-cables and bays) locks the node in some kind of weird state, but i dont understand how this is bringing down the monitoring of the whole pve-cluster.
as soon as our node 9 ryzen9
came back online properly, monitoring for all the cluster-nodes went back to functional aswell.
as stated above, when we shut it down properly for maintenance, the monitoring of the remaining nodes behaved as expected.