Services - Availability takes very long time or is timed out and shows an error-message

Bjorn.A · May 14, 2024, 9:57am

CMK version: RAW Edition 2.2.0p26
OS version: RHEL 9.4

Error message: Unhandled exception: Your request timed out after 110 seconds. This issue may be related to a local configuration problem or a request which works with a too large number of objects. But if you think this issue is a bug, please send a crash report.

Hi!

We have a site with around 380 hosts/service-groups and 9000 service-checks. We have given the Checkmk-server 8 cpu-cores and 16 gb ram. Everything in the GUI responds pretty fast and checks only become stale for a short time when we change a lot in the GUI.

The problem is that each time we try to view availability for last month for example it takes one or two minutes and then it sometimes works and sometimes it shows no entries or timeout with the message I pasted in above. We tried to view availability for services on both hosts and service-groups with between 10 and 40 service-checks. Sometimes we also have to run omd stop and then omd start to get back fully since the site becomes more or less unresponsive after we try to view availability. This feature used to work when we only had around 100 hosts/service groups and 2500 service-checks.

Is this a limitation in the RAW edition so that it can’t use all the cpu-cores? In Nagios Core that we left when we migrated to Checkmk this type of feature to get availability for last month only takes a few seconds. Is it really that cpu-intensive to get availability reports for last month for some hosts/service groups with 10-40 service-checks?

Would be grateful if it’s a setting somewhere that we can change or other tips to get this resolved.

Best Regards
Björn Ahlman

aeckstein · May 14, 2024, 5:54pm

Hi Björn,

i think the main problem will be disk latency and/or CPU usage during the creation of the availability views.

Front the documentation:

When calculating availability, the complete history of the selected object must be reopened. How that works in detail can be learned further below. Especially in the Checkmk Raw Edition, the analysis can take some time, since its core has no cache for the required data and the text-based log data must be sequentially searched…

The Enterprise Edition has a builtin availability cache that is being filled after the site was started which makes that way more efficient and really fast.

Bjorn.A · May 15, 2024, 3:26pm

Hi Andre,

Thanks for the information. We have increased the number of cpu-cores to 12(memory is 16 gb and only 7 gb is used). We have then done some more tests and here are the results:

After a reboot of Checkmk we could get availability reports for all services on three different hosts that have around 40 services each. The reports took between 20 and 25 seconds to generate but they worked. We then tried to generate availability reports again for all services on these hosts and it worked one more time but took a bit longer, around 30 seconds. The third time it did not work anymore and “No entries.” is displayed. It seems that it only works a few times after a reboot and then the problems start again.

We also got one notice “OMD Monitoring performance” - “Site is currently not running” after one try to generate an availability report. The notice went back to OK after a few minutes.

Is there any special log files we could look in that could provide more information about this problem? Can we increase some kind of timeout value that might be stopping the availability reports from being created?

Best Regards
Björn Ahlman

aeckstein · May 15, 2024, 5:52pm

Hi,

i think that the problem will be disk io and disk latency related.
What is the storage backend of the checkmk server ?

If monitoring works during the problems you can analyze the perfomance with the checks and metrics of the checkmk server itself.
If not, you can check that with tools like sar, iotop and top for latency and wait times during the creation of the reports.

Maybe its a good reason to switch to the enterprise edition

Bjorn.A · August 14, 2024, 3:28pm

Hi,

Thanks for the information, it’s probably related to disk latency as you said.
Yes we are planning to upgrade to the enterprise edition pretty soon so that will solve the problem.

Best Regards
Björn Ahlman