Distributed Monitoring: No Website Checks if Master Instance is not available

CMK version: Checkmk Managed Services Edition 2.2.0p25
OS version: Debian 11.9

Hello everyone,

We are currently facing an issue with our Check_MK setup, where the master instance is located in our data center, and we have Check_MK satellites running in the Azure Cloud. The instance in Azure is responsible for performing website checks, while the websites themselves are hosted in our data center.

During a recent storage issue in our data center, nearly all virtual machines were forced into read-only mode. Unfortunately, it seems that no website checks were executed during this outage, which has resulted in highly inaccurate SLA reports. I’ve attached an example report to illustrate the problem, along with a log file from one of our proxies that was unaffected by the read-only issue, as its storage remained intact.

In our setup, most website checks are performed using custom plugins. For instance, we use the following command to carry out the checks:
check-mk-custom!check_http --sni -t 15 -H $_HOSTURL$ -f $_HOSTFOLLOWREDIRECT$ -S -E

Our Azure site is connected to the data center via VPN, and the website checks are executed against the public IP addresses of the websites hosted in the data center.
However, at the exact moment when the master instance became unavailable due to the outage, it appears that all website checks ceased. This is puzzling, as the satellite instance in Azure should theoretically be capable of performing these checks independently. It raises the question of why these checks failed to execute once the master was offline.

I’m wondering if there is some kind of hidden dependency between the master instance and the satellite in Azure that could explain this behavior. Could it be that the master was responsible for scheduling the checks or processing their results, and if so, is there a way to configure the system to allow the checks to continue when the master is down?

Outtage-Time: 10:29-10:49
Proxy_logfile.txt:
Src-IP: 50.50.50.50 => Customer CMK
100.100.100.100 => Our Azure CMK Site

I’d greatly appreciate any insights or recommendations on how to prevent this issue in the future and ensure that the website checks continue running smoothly even if the master instance becomes unavailable.

Thanks in advance for your help!

proxy_logfile.txt (4.9 KB)

Hi Marcel,

in a distributed environment, all satellites are running their monitoring core without any dependencies to the central instance.
So the check_http checks should run independently and without any problems if the central site is missing.
But there are other dependencies in infrastructure and services and my guess would be, that the general connectivity problems were the cause for DNS or routing issues for the satellites in Azure and these where not able resolve or reach the websites in your local datacenter any more. This might be the reason for the socket timeout. If the hostnames were resolved with internal IPs before, that will not change to the external ones, when the tunnel to the datacenter goes down, automatically.

No, that can’t be the problem. We initially had the same assumption. However, we conducted several tests with an interrupted VPN connection, and all website checks continued to work without any issues.

I would also expect that DNS resolution problems or timeouts would be evident in the CMK events, but they are not.

Push
This is likely an incident and should be treated with priority.

This is not an official support forum, people help each others in their free time here.
If the problem has priority for you, open a support ticket with your partner or the checkmk support.

2 Likes

Hi @mwester,

I fully agree with @aeckstein and this is simply not a support forum.

You said you have a site running on Azure that checks the hosts that are hosted in your datacenter.
That means the host in Checkmk that is having the service for checking your websites is running on your satellite site. Can you please confirm this, maybe with a screenshot?
(Just go to the hosts and then to the host configuration)

As Andre already said. Every Checkmk site is running on their own. The master is using the livestatus connection to gather the data from the satellite sites when you browse the monitoring from the master.

Looking forward to your response.

Best Regards
Norm

1 Like

Also from my experience, we don’t have a problem here. Every site acts for it’s own. The active checks work or work not but don’t depend on other remote sites.
Your screenshot or log from the first post don’t show anything usable.
First i would look at the core log from all your sites. There you see what was done at the time of your problem.

1 Like

Thanks for your responses and ideas, and I apologize for my impatience. On our Azure instance, we have 558 hosts, each with two service checks. One is a certificate check using the following command:

check-mk-custom!check_http -N -t 15 -H $_HOSTURL$ -p $_HOSTPORT$ -f $_HOSTFOLLOWREDIRECT$ -S -C $_HOSTTLSEXPIREDAYS$ --sni

The other is a website check with the following command:

check-mk-custom!check_http --sni -t 15 -H $_HOSTURL$ -f $_HOSTFOLLOWREDIRECT$ -S -E

That amounts to 1,116 services, and each check is performed every minute.

Our downtime: 10:29-10:49
Hosts with the exact same downtime (Host Check Port 443): 2
Certificate service check: 6
Website service check: 13

Is there an explanation as to why the downtime was detected for only a few hosts/services and not for others?

A question regarding my first post: It is clear that no more checks were executed—this is visible in the proxy log file. So there should have been some indication in the events of a timeout or something else that points to an outage. The last check in the proxy log was at 10:29. However, the first socket timeout was only recorded in the events at 10:59 (also attached in the first post). How can this be explained?

Hey,
Just giving this a quick bump to see if anyone has any ideas or input

Without the information from the core history no one can say anything.

1 Like

Hello everyone,

We believe we’ve found the cause. Our Azure CheckMK only performs website and certificate checks using the Nagios plugin check_http.

In the event of a total infrastructure outage, the maximum number of currently 20 active checks is triggered. The maximum timeout per website check is 20 seconds. All other checks then become stale. If the checks were still in an “OK” status before going stale, they will likely only be executed again once our infrastructure is available. This could explain why the outage might not appear in the events.

Does this sound plausible to you?
During the outage, the “Active check helper usage” was also at 99.9%.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.