Slow Web-UI after Upgrade to 2.0

righter · September 2, 2021, 11:29am

Hi

We’ve been using the latest 1.6 Release with 260 Hosts and 4700 Service with the Enterprise edition.
With 1.6 everybody was fast and smooth.

Since we upgraded to 2.0 the Web-UI is sometimes really slow.
I cannot see any errors in the log. also not a very high CPU or RAM usage.
So it its 3-5 times slower as before on the UI

Any idea how to deep dive in this problem?

thanks

KartoffelSalat · September 2, 2021, 1:30pm

I see similar behavior with our installation. But the slowness is not always but most of the time, but when then significantly.

Any ideas on what system-parameters may be finetuned?

andreas-doehler · September 2, 2021, 7:47pm

There are different possible problems.
For a slowness of the UI the first thing to look is the livestatus connection to the core.
How is this setup in your system?
If it is enterprise then it should use the livestatus proxy also on the local connection.
Here you can take a look at the livestatus status file. If there are all channels use all the time you will get a slow UI.
All channels used can have different sources.

KartoffelSalat · September 3, 2021, 11:51am

In my case i have an instance acting as hub for four instances doing actually checks and probes. For all instance the livestatus proxy daemon is setup.

Can you name any sources where I may find any parameters to fiddle around with and how to analyse the situation? I didn’t even find anything about a livestatus status file via google

Thnx in advance.

andreas-doehler · September 3, 2021, 12:51pm

The important file is the “liveproxyd.state” inside “~/var/log/”.
There you see if all channels are used or some are free

In a site without any connections active the content looks like this.

Current state:
[cmk]
  State:                   ready
  State dump time:         2021-09-03 12:50:04 (0:00:00)
  Last reset:              2021-09-03 12:49:49 (0:00:15)
  Site's last reload:      2021-09-03 12:43:46 (0:06:18)
  Last failed connect:     Never
  Last failed error:       None
  Cached responses:        1
  Channels:
       9 - ready             -  client: none - since: 2021-09-03 12:49:49 (0:00:15)
      10 - ready             -  client: none - since: 2021-09-03 12:49:49 (0:00:14)
      11 - ready             -  client: none - since: 2021-09-03 12:49:54 (0:00:10)
      12 - ready             -  client: none - since: 2021-09-03 12:49:59 (0:00:05)
      13 - ready             -  client: none - since: 2021-09-03 12:49:49 (0:00:15)
  Clients:

you can change the number of channels if there are too few.

KartoffelSalat · September 3, 2021, 3:23pm

Thnx a lot for this information

Unfortunately on this side everthing seems to fine; but I’ll take a look next time I expierience any slowness.

Update: Just again have a reproduceable slow UI when accessing notifcation rules. I then disabled the remote-site access with liveproxy and suddenly all performance issues where gone

But I wonder what could be tuned with that liveproxy since it always displayed all channels ready

@righter maybe you could check this too

righter · September 6, 2021, 6:44am

Hi

Strange my var/log/liveproxyd.state seems to be empty:

OMD[STAR]:~$ cat var/log/liveproxyd.state 
----------------------------------------------
Current state:

@KartoffelSalat
Where have you disabled the remote-site access? Has that any impact on something?

KartoffelSalat · September 6, 2021, 7:59am

@righter In the Distributed Monitoring section, when you are in edit connection in the section Status connection you have the option Use Livestatus Proxy Daemon where you can select Connect directly.

sebkir · September 8, 2021, 3:39pm

Hello @righter and @KartoffelSalat,

do any of you possibly use the dashlets “Host statistics” or “Service statistics”? Though you did state, that you do not see an “increase in CPU”*, perhaps you could still check out, if you are using them. If so, please to removing and re-adding them and report back. There will be a fix for this specific issue in the p10 → Fixed steadily rising CPU due to misconfiguration when cloning builtin dashboards

* It would be interesting to know what you mean exactly. Load or utilization? Additionally, I would like to point to a very interesting article in our official guide, when it comes to monitoring the CPU utilization also in this scenario. Because it might be the case, that only one process is causing a lot of trouble and blocking an entire core, without you noticing, because the 7 - 127 other cores don’t have anything to do and hence the (average) CPU utilization is very low. Check out at least Chapter 1 in the following article. It was an eye opener for me a couple of years ago: Best practices, tips & tricks

righter · September 9, 2021, 12:33pm

Hi

It’s strange two days after the upgrade, I can’t see any performance problems anymore.
Now it runs smooth…

I’ve checked the load. So I haven’t see one core which is on 100% all the time.

thanks anyway

KartoffelSalat · September 13, 2021, 12:31pm

In my case I neither see any significant load nor CPU utilisation, no hanging processes.

The phenomenon when using the liveProxyDaemon is that occasionally when you enter a page (eg services of host) the page loads forever since being reloaded once or twice.

I whish I could provide any usefull information as a clue to the cause …

I doubt the stats dashlets are involved in that, since we normally don’t use any of the dashboards.

To me as ordinary person It feels like some sort of broken pipe in the UI

KartoffelSalat · September 17, 2021, 6:30am

For my case, after trying a lot of things to encircle the problem, it turned out that the reason was some occasional packet-loss on the route to the central-instance wich was causing the slowness.

system · September 17, 2022, 6:30am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.