Slow folder trees with hosts since upgrade to 2.5.0p5

CMK version: Community 2.5.0p5
OS version: Ubuntu 24.04 LTS

Error message: Slow loading of Hosts in a folder in distributed environment

Setup: Distributed setup with 1 admin host and 7 remote sites

Issue:
I just did a 2 stage upgrade from 2.4.0p24 → 2.4.0p31 → 2.5.0p5
Since 2.5 the content of any folder on configuration loads very slow. It takes 14 seconds to load a folder with 20 hosts, 17 seconds to load a folder with 32 hosts etc
This was very snappy on 2.4, but now a frustration.

Browsing through folders goes quick.
We only have hosts in the lowest folders, a folder with only folders loads normal.

Expected situation:
Snappy folder trees with hosts.

I’ve been digging into it with the help of Claude, and this is the what we have found.

Setup folder view in 2.5.0 issues one livestatus query per host, fanned out to all sites (N+1, GUI blocks for tens of seconds)

CMK version: 2.5.0p5 (Raw / Community Edition) OS version: Ubuntu 24.04 Edition / setup: Distributed monitoring, 1 central site + 7 remote sites, no liveproxyd (CRE)

Summary

Since upgrading from 2.4.0 to 2.5.0, opening any Setup → host folder that contains hosts is very slow on the central site. The slowness scales with the number of hosts in the folder, and it occurs even for folders whose hosts are all Monitored on site = central, so it is not a remote-site reachability problem in the usual sense.

Request profiling shows the Setup folder view issues one livestatus query per host during render, and each of those queries is broadcast to all connected sites instead of being scoped to the single site that monitors the host. The GUI then blocks serially in select() waiting for every site to answer, per host.

Observed timings

Two folders, document load time of wato.py?...&mode=folder:

Hosts in folder Page load
~16 ~12.0 s
~105 ~48.0 s

Solving fixed + per_host * n gives roughly ~0.4 s per host plus a ~5 s fixed cost. Everything else in the browser waterfall is served from disk/memory cache in the tens of ms; only the mode=folder document and ajax_sidebar_get_sites_and_changes.py are slow.

What was ruled out

  • DNS: All affected hosts have a static IPv4 address configured and IP address family = IPv4 only. getent hosts <name> resolves in 2–3 ms from the site shell. (I initially suspected werk #19061, but the profile shows no name resolution in the hot path.)

  • Connection churn / TLS: connect_to_site is called exactly 8 times (7 remotes + central), once each, totalling ~4.3 s — this is the fixed cost. No per-host reconnect.

  • Process spawning / ping probe: negligible subprocess activity in the profile.

Profile evidence

Setup → Global settings → Profile requests, single load of the ~105-host folder. cProfile, sorted by cumulative time (trimmed):

1562067 function calls in 42.546 seconds
Ordered by: cumulative time

ncalls  tottime  cumtime  filename:lineno(function)
     1   0.001   41.339  cmk/gui/wato/pages/folders.py:1032(_show_hosts)
   106   0.013   41.292  cmk/gui/wato/pages/folders.py:1091(_show_host_row)
   106   0.010   40.957  cmk/gui/wato/pages/folders.py:1246(_show_host_actions)
   107   0.003   36.364  cmk/livestatus_client/_connection.py:1217(query)
   107   0.001   36.359  cmk/livestatus_client/_connection.py:1260(query_parallel)
   107   0.006   35.982  cmk/livestatus_client/_connection.py:1321(_retrieve_responses)
 50986   0.042   35.888  cmk/livestatus_client/queries.py:446(iterate)
  1962   0.006   35.321  cmk/livestatus_client/_connection.py:1521(is_socket_readable)
  1617  35.311   35.311  {built-in method select.select}

Sorted by internal time (tottime), the wall clock is almost entirely socket wait:

1617  35.311  {built-in method select.select}

Filtered call counts for the livestatus path:

   query              107 calls   cumtime 36.4 s
   query_parallel     107 calls
   send_query         856 calls
   receive_data      1712 calls
   is_socket_readable 1962 calls  cumtime 35.3 s
   connect_to_site      8 calls   cumtime  4.34 s   (once per site)
   set_only_sites     213 calls

Root cause (as I read it)

_show_host_actions() in cmk/gui/wato/pages/folders.py runs a livestatus query per host (106 hosts → 107 queries). Each logical query results in ~8 send_query calls (856 / 107 ≈ 8 = the number of connected sites), i.e. the per-host status query is not restricted to the host’s owning site — it is broadcast to all 8 sites. The GUI then blocks in is_socket_readable / select() per host until all sites respond.

This produces an N×M pattern (hosts × sites) of blocking round-trips inside a single page render. On CRE there is no liveproxyd to amortise this, so the full per-site latency is paid for every host. It also explains why central-only folders are slow: those hosts’ status queries are still broadcast to all 7 remotes.

A correct folder render should either issue one batched query for the whole folder, or at minimum scope each host’s status query to the single site that monitors it (set_only_sites([owning_site])).

Reproduction

  1. Distributed CRE, 1 central + several remotes.

  2. Open a Setup host folder with N hosts (all with static IPs, IPv4-only).

  3. Page load time grows ~linearly with N; profile shows one query per host in _show_host_actions, each fanned out to all sites.

Impact

Setup/WATO folder navigation becomes unusable on larger folders in distributed CRE setups after the 2.4 → 2.5 upgrade. Folders that loaded in ~1 s on 2.4.0 now take tens of seconds.

Question for the devs

  • Is the per-host query in _show_host_actions intended, or is the missing per-host set_only_sites scoping a regression introduced in 2.5.0?

  • Can the folder view be changed to a single batched livestatus query per folder?

Happy to provide the full multisite.profile / multisite.cachegrind and the browser waterfalls.