I’ve been digging into it with the help of Claude, and this is the what we have found.
Setup folder view in 2.5.0 issues one livestatus query per host, fanned out to all sites (N+1, GUI blocks for tens of seconds)
CMK version: 2.5.0p5 (Raw / Community Edition) OS version: Ubuntu 24.04 Edition / setup: Distributed monitoring, 1 central site + 7 remote sites, no liveproxyd (CRE)
Summary
Since upgrading from 2.4.0 to 2.5.0, opening any Setup → host folder that contains hosts is very slow on the central site. The slowness scales with the number of hosts in the folder, and it occurs even for folders whose hosts are all Monitored on site = central, so it is not a remote-site reachability problem in the usual sense.
Request profiling shows the Setup folder view issues one livestatus query per host during render, and each of those queries is broadcast to all connected sites instead of being scoped to the single site that monitors the host. The GUI then blocks serially in select() waiting for every site to answer, per host.
Observed timings
Two folders, document load time of wato.py?...&mode=folder:
| Hosts in folder |
Page load |
| ~16 |
~12.0 s |
| ~105 |
~48.0 s |
Solving fixed + per_host * n gives roughly ~0.4 s per host plus a ~5 s fixed cost. Everything else in the browser waterfall is served from disk/memory cache in the tens of ms; only the mode=folder document and ajax_sidebar_get_sites_and_changes.py are slow.
What was ruled out
-
DNS: All affected hosts have a static IPv4 address configured and IP address family = IPv4 only. getent hosts <name> resolves in 2–3 ms from the site shell. (I initially suspected werk #19061, but the profile shows no name resolution in the hot path.)
-
Connection churn / TLS: connect_to_site is called exactly 8 times (7 remotes + central), once each, totalling ~4.3 s — this is the fixed cost. No per-host reconnect.
-
Process spawning / ping probe: negligible subprocess activity in the profile.
Profile evidence
Setup → Global settings → Profile requests, single load of the ~105-host folder. cProfile, sorted by cumulative time (trimmed):
1562067 function calls in 42.546 seconds
Ordered by: cumulative time
ncalls tottime cumtime filename:lineno(function)
1 0.001 41.339 cmk/gui/wato/pages/folders.py:1032(_show_hosts)
106 0.013 41.292 cmk/gui/wato/pages/folders.py:1091(_show_host_row)
106 0.010 40.957 cmk/gui/wato/pages/folders.py:1246(_show_host_actions)
107 0.003 36.364 cmk/livestatus_client/_connection.py:1217(query)
107 0.001 36.359 cmk/livestatus_client/_connection.py:1260(query_parallel)
107 0.006 35.982 cmk/livestatus_client/_connection.py:1321(_retrieve_responses)
50986 0.042 35.888 cmk/livestatus_client/queries.py:446(iterate)
1962 0.006 35.321 cmk/livestatus_client/_connection.py:1521(is_socket_readable)
1617 35.311 35.311 {built-in method select.select}
Sorted by internal time (tottime), the wall clock is almost entirely socket wait:
1617 35.311 {built-in method select.select}
Filtered call counts for the livestatus path:
query 107 calls cumtime 36.4 s
query_parallel 107 calls
send_query 856 calls
receive_data 1712 calls
is_socket_readable 1962 calls cumtime 35.3 s
connect_to_site 8 calls cumtime 4.34 s (once per site)
set_only_sites 213 calls
Root cause (as I read it)
_show_host_actions() in cmk/gui/wato/pages/folders.py runs a livestatus query per host (106 hosts → 107 queries). Each logical query results in ~8 send_query calls (856 / 107 ≈ 8 = the number of connected sites), i.e. the per-host status query is not restricted to the host’s owning site — it is broadcast to all 8 sites. The GUI then blocks in is_socket_readable / select() per host until all sites respond.
This produces an N×M pattern (hosts × sites) of blocking round-trips inside a single page render. On CRE there is no liveproxyd to amortise this, so the full per-site latency is paid for every host. It also explains why central-only folders are slow: those hosts’ status queries are still broadcast to all 7 remotes.
A correct folder render should either issue one batched query for the whole folder, or at minimum scope each host’s status query to the single site that monitors it (set_only_sites([owning_site])).
Reproduction
-
Distributed CRE, 1 central + several remotes.
-
Open a Setup host folder with N hosts (all with static IPs, IPv4-only).
-
Page load time grows ~linearly with N; profile shows one query per host in _show_host_actions, each fanned out to all sites.
Impact
Setup/WATO folder navigation becomes unusable on larger folders in distributed CRE setups after the 2.4 → 2.5 upgrade. Folders that loaded in ~1 s on 2.4.0 now take tens of seconds.
Question for the devs
-
Is the per-host query in _show_host_actions intended, or is the missing per-host set_only_sites scoping a regression introduced in 2.5.0?
-
Can the folder view be changed to a single batched livestatus query per folder?
Happy to provide the full multisite.profile / multisite.cachegrind and the browser waterfalls.