Since our recent upgrade to a distributed replication setup using CME we are now facing major performance issues when trying to edit BI rules or aggregations. It is a lot more severe for editing rules than it is for aggregations, though.
Technical environment details
- checkmk version: 1.6.0p20 (parent) and 1.6.0p17 (most child sites)
- OS version: nearly all sites are running in CME docker containers, very few are installted natively on RHEL 7.x
- Sites: roughly 70, some of them in remote locations accessible by site-to-site VPNs
- Hosts: roughly 7000
- Services: roughly 130’000
- BI rules (very rough count from the overview): 850
- BI aggregations (very rough count from the overview): 530
- Addendum: The browser is running as a Citrix published app
Since the issue is more severe for rules, I am only mentioning rules from now on.
Whenever one tries to edit a BI rule, they have to wait for an extensive period of time until they are actually able to make changes to it. This gets more severe the more nodes a rule contains.
We first thought it might be a bottleneck on the server side, thus checking resource usage and traffic going through the reverse proxy and the checkmk host. However, neither is at their resource limit (or anywhere near).
Next, we fired up the developer console in Edge (Chromium) and started logging network calls as well as performance profiling. This showed that the page load itself is done in around 5 to 10 seconds for any rule we try to edit. Nonetheless, the pages fully freeze after the loading process is done. Edge and Chrome start showing “This page isn’t responding”.
The performance profile, on the other hand, showed interesting results. It looks like the culprit is a looped call to jQuery’s t.fn.select2 functions. Summed up they take around 150 (!) seconds to complete for a BI rule with 19 nodes, while all other JS functions together complete in roughly 2 seconds for the same rule:
For BI rules that contain only a few nodes, e.g. 1, the loading time is in the lower seconds, may be up to 10 (on top of the page loading time).
Currently, editing medium-sized and larger BI rules is nearly impossible.
Has anyone faced similar issues? Do you know of any solution or hints that might help improve the situation?
Thank you and kind regards
PS: I saved the Edge performance profile and took a video of the issue. However, since both contain customer information, I am unable to easily share them. In case it helps I will take the time to anonymize the data and provide it, though; you tell me .