PVE - weird behaviour breaks NMS+Cluster-Monitoring

intelliIT · April 1, 2025, 6:35am

hi there, sorry for the late response; had some vacation time.

will do the updates asap.
i have only one checkmk-server.

so the thing confusing me, is that the non-availability of 1 of my 9 pve-nodes is somehow crippling the monitoring of the 8 remaining (timeout).
the Request failed. (Too many active connections) was more or less the only thing i found on the remaining, non-functional nodes.
it may be the the disk failure (we mitigated the issue by swapping sata-cables and bays) locks the node in some kind of weird state, but i dont understand how this is bringing down the monitoring of the whole pve-cluster.
as soon as our node 9 ryzen9 came back online properly, monitoring for all the cluster-nodes went back to functional aswell.

as stated above, when we shut it down properly for maintenance, the monitoring of the remaining nodes behaved as expected.