Web Interface slow when Livestatus TLS Encrypt. enabled in distributed monitoring

sphinxnh · December 22, 2021, 8:52pm

Hi,

thanks for supporting us in this matter. In my specific test case your proposed change solves the issue of a slow web interface when using TLS encrypted connections to other sites.

In my naive understanding of a timeout value it does not make sense to change it, until the timeout actually hits. Specifying a timeout of 1s seems reasonable in a WAN environment, but should never trigger in our test scenario: running two checkmk instances on a single piece of hardware in the same virtual network, actual network latency < 1ms. Those VMs are isolated - we do not have network latency issues or connection problems. Following this hypothesis, changing the timeout value does not solve the underlying issue and is not the cause for this either.

I just randomly ran into the analyze configuration feature of checkmk and saw that all sites (even the local site) are marked as critical in terms of Livestatus usage:

CRIT: The current livestatus usage is 100.00% (!!), 20 of 20 connections used (!!), You have a connection overflow rate of 0.00/s (!!)

I restarted all sites several times and due to the nature of the isolated environment I am the only user of the system.

Several questions arose:

Are this many used connections designed behavior? If so, why is this check red?
Is this symptom related to this issue?
Why are there so many used connections?
What are the connections used for?
What is the actual cause for so many open(?) connections?
How may I decrease the number of open connections?

Best,
nh