Checkmk Enterprise 2.1.0
In our Central Monitoring System we used historically the possibility to assign the hosts and services to the contact groups in the folder properties permission options.
We are doing this without any issue in our Local Monitoring System since years so we just took it over to the Central Monitoring System.
Over time the Central Monitoring System was populated with more and more hosts in different folders which are based on the application and operating system. The more application we added the longer the process ran to activate the changes and finally some remote sites ran in to timeout while activating changes. After a bit digging we found out that creating the core configuration took over 7 Minutes (1 Minute real) which was way too long.
This was the time to open a support ticket with checkmk.
After doing some profiling together with chackmk support we found out that the kernel spent a lot of time dealing with rules to assign contact groups. With some further digging we found over 40.000 rules assigning Services to contact groups and a couple of thousands assigning Hosts to contact groups. These rules where added to any hosts.mk file in the current folder and all folders below by the Setup if in the folder permissions the options Add these groups as contacts to all hosts in this folder and Add these groups as contacts to all hosts in all subfolders of this folder in combination with Always add these groups as contacts to all services in all subfolders of this folder selected.
As we assign the contact groups at the top of the folder tree the rules where added to thousands of hosts.mk files in the sub folders.
For the host_contactgroup assigenment luckily a list is used so we have βonlyβ the number of rules based on the number of hosts.mk files:
Assume we have a folder named folder_a which contains folder_a-a and folder_a-b but no hosts. Folder_a-a and folder_a-b contains hosts. We use the permission options of properties of folder_a.
For the two folders with hosts.mk files we get two rules to assign the groups to the hosts :
host_contactgroups.insert(0, {'value': ['contacgroup_a', 'contactgroup_b'], 'condition': {'host_folder': '/wato/folder_a/folder_a-b/'}})
host_contactgroups.insert(0, {'value': ['contacgroup_a', 'contactgroup_b'], 'condition': {'host_folder': '/wato/folder_a/folder_a-a/'}})
The worse thing is the assignment of service_contactgroup. Here we have in addition one rule per contact group and per hosts.mk file:
service_contactgroups.insert(0, {'value': 'contacgroup_a', 'condition': {'host_folder': '/wato/folder_a/folder_a-b/'}})
service_contactgroups.insert(0, {'value': 'contactgroup_b', 'condition': {'host_folder': '/wato/folder_a/folder_a-b/'}})
service_contactgroups.insert(0, {'value': 'contacgroup_a', 'condition': {'host_folder': '/wato/folder_a/folder_a-a/'}})
service_contactgroups.insert(0, {'value': 'contactgroup_b', 'condition': {'host_folder': '/wato/folder_a/folder_a-a/'}})
So, with ~40 contact groups and ~1000 folders containing hosts and having therefor a hosts.mk file we had finally 40.000 rules just to assign the contact groups to the services.
The contact groups are assigned to the service automatically when the contact group is assigned to the host and there is no rule to assign a contact group to a specific service (see 6. Assigning services). Luckily this was the case for us, all users having access to the host should also have access to all services. With this in mind we removed the option Always add these groups as contacts to all services in all subfolders of this folder in permissions of the folder properties. As this options are used rarely only on top folders we had been able to remove ~ 40.000 rules with a few mouse clicks.
Alone this step dropped the time for building the core configuration down to 24 seconds (0.18 sec real).
In the next step we will go and replace the option βAdd these groups as contacts to all hosts in this folderβ in the permissions of the folder properties by a rule βAssignment of hosts to contact groupsβ one by one to get rid of the remaining ~1000 rules. Finally we should be able to drop down from 40.000 rules to below 100 rules.
Side Note: ββ¦ in this folderβ is a barefaced lie, because it puts the rule in every hosts.mk file of the current folder AND all sub-folders. The option Add these groups as contacts to all hosts in all subfolders of this folder has no effect in hosts.mk file. The option Always add these groups as contacts to all services in all subfolders of this folder only put the rules in the hosts.mk file if the option Add these groups as contacts to all hosts in this folder is also selected.
I want t say a big thank you to my colleagues and to the brilliant support of checkmk to find the root cause of this.
Conclusion: Its not a solution to just add more resources to the server hosting the monitoring system when it becomes slow. Its always worth to have a look on how the setup is done and strive to have a lean and efficient rules set .