Web Interface slow when Livestatus TLS Encrypt. enabled in distributed monitoring

CMK version: Raw 2.0.0p17 (stable)
OS version: Ubuntu 20.04.3 LTS

Hi guys,

I searched the forum but was unable to find a similar post. Just let me know if this problem is already known/solved or if there is an existing post in this matter.

We are currently evaluating check_mk 2.0 using the current raw edition, stable version in a common distributed monitoring environment (1 master and 1 slave site). Therefore, we spawned two virtual machines with identical specs (4 Cores, 8GB RAM, 32GB root fs on SSD-backed storage, 1 vnic in same layer 2 network) running the current Ubuntu LTS 20.04.3 on a KVM Hypervisor and installed check_mk RAW Edition, 2.0.0p17. We configured the our “slave” site and made sure LIVESTATUS_TCP_TLS is enabled there. As soon as we added the site to our master via the distributed monitoring (login & trust CA Certificate) any action via the masters web GUI takes 2-5 times longer compared to an unencrypted site connection.

In more detail:

  • usually reloading index.py takes in avg. 2-3s, with TLS enabled the reload takes >8s
  • we also experience this when we load graphs which takes about 7-12s
  • drilling and sliding within the graphs is as unresponsive as the rest of the web interface

There might be an issue with my configuration and, therefore, I would be glad if someone could provide further information how I may debug this issue.

Thanks and best regards,
nh

Are you using the livestatus proxy? If not, you should enable it both for the remote and local site.

Nevertheless, the encryption should have that kind of impact on your infrastructure. Especially when they are this close to each other network-wise.

Are there any hints performance-wise regarding your checkmk servers?

Hi Robin,

thanks for your response! The instances are running on the same Hypervisor, there is minimal network latency involved here. We will install an CEE in order to test our scenario with the livestatus proxy.

Would you elaborate on performance-wise hints? We do not observe any CPU, Memory or Disk I/O Bottleneck - utilization peaks to 10-25% max. The VMs are in idle state.

Best regards,
nh

The CFE would have been another side note later on. But the CRE does not have worse performance regarding the web interface, than the enterprise editions. It is the core that makes the difference. How many hosts and services are you monitoring currently?

I meant just what you said with ‘performance-wise’. Sometimes you have hardware resources slowing your whole application.

Hi Robin,

we currently monitor 29 hosts with about 1300 services on the slave site and 3 hosts with about 100 services on the master site.

Yeah, IMHO I would not expect any improvements using CEE, but I might be wrong :wink:

Best regards,
nh

Alright, those numbers do not pose any problem even to the nagios core.

Well there is a bunch of awesome functionality in the enterprise editions, but the generic monitoring is basically the same. That’s why I would like to understand you issue better, but it is probably not worth the time at this point.

Sorry, but once updated all the 3 distributed nodes to 2.0.0p17 (raw edition) I’ve observed a slowdown of the web interface (search, edit, navigation in general).
So I disabled encryption, and now the web interface is as fast as before (I was using 2.0.0p12)

Thanks for your help and support. Eventually I will have a look how we can trace the code execution in order to find the function causing the performance issues.

If you reach out to our sales people for your evaluation we might be able to get a consultant to take a look at your setup, I am quite sure this is not a typical behavior.

Hi,
we are running checkmk 2.0p17 cre with an encrypted livestatus connection, too. The gui is really slow. After we disabled tls encryption for all livestatus queries everything is working fast and responsive again.

Before we had a 1.6CRE up and running in a distributed environment with tls enabled for livestatus queries and had no slowdown for the web ui.

bye
David

Hi,
If you have livestatus connection problems when using TLS and no liveproxyd in between, you can set the poll timeout to a lower value in the following file.

--- a/lib/python3/livestatus.py
+++ b/lib/python3/livestatus.py
@@ -531,7 +531,7 @@ class SingleSiteConnection(Helpers):
         receive_start = time.time()
 
         while size > 0:
-            readylist = self._socket_poller.poll(1000)
+            readylist = self._socket_poller.poll(10)
             if readylist or (isinstance(self.socket, ssl.SSLSocket) and self.socket.pending()):
                 packet = self.socket.recv(size)
                 if not packet:

SSL sockets + select/poll => Tons of fun…

Regards
Andreas

Hi,

thanks for supporting us in this matter. In my specific test case your proposed change solves the issue of a slow web interface when using TLS encrypted connections to other sites.

In my naive understanding of a timeout value it does not make sense to change it, until the timeout actually hits. Specifying a timeout of 1s seems reasonable in a WAN environment, but should never trigger in our test scenario: running two checkmk instances on a single piece of hardware in the same virtual network, actual network latency < 1ms. Those VMs are isolated - we do not have network latency issues or connection problems. Following this hypothesis, changing the timeout value does not solve the underlying issue and is not the cause for this either.


I just randomly ran into the analyze configuration feature of checkmk and saw that all sites (even the local site) are marked as critical in terms of Livestatus usage:

CRIT: The current livestatus usage is 100.00% (!!), 20 of 20 connections used (!!), You have a connection overflow rate of 0.00/s (!!)

I restarted all sites several times and due to the nature of the isolated environment I am the only user of the system.

Several questions arose:

  • Are this many used connections designed behavior? If so, why is this check red?
  • Is this symptom related to this issue?
  • Why are there so many used connections?
  • What are the connections used for?
  • What is the actual cause for so many open(?) connections?
  • How may I decrease the number of open connections?

Best,
nh

This is normal for the “Raw” edition. As the feature what is checked there is not existing. If you test with the free edition or if you have somewhere an enterprise then you will see there some values.

Thanks Andreas for clarifying this. My observation is a dead-end then.

Hi,

bug or feature? Should not beeing the normal behavior for user of raw version, i hope so :wink:
@robin.gierse: had the same issue. i think this must be troubleshooted by cmk-dev´s.

cheers
Dennis

1 Like

Hi Dennis,
I’ll implement a better ssl socket handling in the next release.
The following link pretty much explains the general problem with these ssl sockets.

Regards
Andreas

5 Likes

Hi Andreas,

the next release you mentioned is not by chance 2.0.0p18?

bye
David

Such a big new implementation is most likely meant for version 2.1 :slight_smile: