Strange behaviour of Check_MK Service on Cluster Node

Hi all,

cmk version: 2.0.0p18 (CEE)

We noticed a strange behavior on our cluster hosts which have usually two nodes. They are getting the data from checkmk agents.
When one of the cluster nodes is down, the Check_MK and the Check_MK Discovery Service goes Critical on the cluster host in checkmk (Summary: “Communication failed: timed out” or “no route to host”).

The services which are assigned to the cluster are still working as expected, they are getting the data from the other node.

We had expected that the Check_MK service will not be in Critical state since the other node is still running.

Has someone an idea how we can change this behavior?

Best Regards
Thomas

afaik not all checks in 2.0 are cluster aware anymore because of the new API. You need to migrate this checks using the new cluster logic provided by the APIv1.

regards

Michael

1 Like

We don’t have any issues with the other services (most of them are local checks), we have only a problem with the check_MK and the Check_MK Discovery Service when one node of the cluster is down.
Here is an example screenshot:

Best Regards
Thomas

We never put this two services on the cluster host object and I guess these are not cluster aware. Leave it at the host object.

BR

MF

We have not manually assigned the check_MK and the Check_MK Discovery Service manually to the cluster, they appear automatically when we assign some checkmk agent based services (e.g. Process Monitoring) to the cluster host.
Did you have manually removed both services from the cluster, or is it any configuration problem on our site that these services are automatically added to the cluster host?

Best Regards
Thomas

I can speak only for 1.6.
There is a rule “Clustered services” with which you assign specific services to the cluster object.
Please check this rule and only assign the real clustered services.

From your screenshot I see that its about DB2 monitoring. At least from 1.6 I know that these services are not cluster aware. We did our own development for DB2 on AIX.

regards

Michael

Same situation with 2.0.0p12

The cluster host has its own IP adress (VIP cluster) which is reachable from check_mk

image

OK, this time its about MSSQL and other agent version, so you have this issue on different systems? If I well understand you create three host objects, one for NodeA, one for NodeB and a third one for the virtual IP?
To be honest I have no idea why this happen on your virtual IP, we never did that this way because of various reasons. Normally virtual IP should roam to the former standby node but maybe firewall is blocking because the new primary node is in different network segment, or virtual IP is not bound to the checkmk agent on this specific node…

The way we do it is to use the rule “Clustered services” to assign ONLY the services which are clustered to a cluster object. Create a host object for NodeA and NodeB and then a Cluster Host Object (Button New Cluster) and add as hosts NodeA and NodeB.
With help of that you omit double monitoring of system specific checks like DF etc. (OK, you can disable them on your host object using the virtual IP but that’s not efficient)

See the official docs for details:

regards

Michael

Hi Mike,
I don’t know if you missed it, but Davide is another customer and not Thomas. So already at least two customers have a problem with cluster nodes.
I’m also not sure if the actual problem is understood. We have an AIX HACMP cluster consisting of two physical nodes (A+B) and a total of 3 IPs. The two nodes each have a persistent IP and the cluster service is reachable via another IP (let’s call it shared IP) that changes from node A to node B and vice versa, depending on which node fails. This IP is therefore permanently accessible, regardless of whether a node fails or not. In checkmk we create two hosts and then an additional cluster host (the way you mentioned it), which consists of the two hosts. The name of the cluster object is directed to the shared IP in DNS. In this respect, our expectation is that if a node fails, the cluster object will still remain green. Instead, the two services Check_MK and the Check_MK Discovery go red. The whole thing is a checkmk default installation. As soon as we create a cluster object and assign some services there, the two additional services Check_MK and the Check_MK Discovery appear there as well. This is default, we have not actively assigned these services there. So the question is, why does checkmk show the cluster host as unreachable, although the shared IP is reachable via DNS from checkmk. From our point of view this is definitely a bug.

regards
Christian (colleague of Thomas)

Only the host check of the cluster IP will be “green”.

Please insert the cluster IP directly for the cluster object.
Then this cluster is handled as a normal host for the host check command.
You will see a ping check or the smart ping as host check.

If no IP is configured then CMK don’t try to resolve the hostname like it is done with normal hosts.

The host check output looks like this → Assumed up, because at least one parent is up
Can you show how the output looks like if one cluster node is down? Unreachable cannot be the right state.

For the other problem with the CMK service and the discovery service this is perfectly fine if no other behavior of the service is configured. You have a communication problem with one cluster node and it is shown as timeout/no route to host or any other error message for a communication problem.

Booth services are not communicating with cluster IP but separately with the node IP’s.

Please apologize for the mistake, indeed I doesn’t recognized that its another user.

We have exactly same Situation with DB2 on AIX HACMP. What I can definitely say is that DB2 checks are not cluster aware in 1.6. We didnt tested in 2.0 but I am sure that they still run the same code as in 1.6 for DB2. Because of that we did a complete rework of the DB2 plugin and checks.

Our configuration looks like that. Maybe you can compare.

OMD[master]:~$ cmk -D DB2RC4

DB2RC4 (cluster of nodea, nodeb)
Addresses:              10.20.30.40
Tags:                   [address_family:ip-v4-only], [agent:cmk-agent], [aix:aix], [criticality:prod], [hosttype:aix_vm], [ip-v4:ip-v4], [networking:lan], [piggyback:auto-piggyback], [site:master], [snmp_ds:no-snmp], [tcp:tcp]
Agent mode:             Normal Checkmk agent, or special agent if configured
OMD[master]:~$ cmk -D nodea

nodea
Addresses:              10.20.30.41
Tags:                   [address_family:ip-v4-only], [agent:cmk-agent], [aix:aix], [application:db2], [criticality:prod], [hosttype:aix_vm], [ip-v4:ip-v4], [networking:lan], [piggyback:auto-piggyback], [site:master], [snmp_ds:no-snmp], [tcp:tcp]
Agent mode:             Normal Checkmk agent, or special agent if configured
OMD[master]:~$ cmk -D nodeb

nodeb
Addresses:              10.20.30.42
Tags:                   [address_family:ip-v4-only], [agent:cmk-agent], [aix:aix], [application:db2], [criticality:prod], [hosttype:aix_vm], [ip-v4:ip-v4], [networking:lan], [piggyback:auto-piggyback], [site:master], [snmp_ds:no-snmp], [tcp:tcp]
Agent mode:             Normal Checkmk agent, or special agent if configured

regards

Michael

Hi Andreas,

He should see this already before the switch on the former passive node or did I missed here something?

regards

Michael

Same with version 2.1.0p12.

Here is my case:
Conf:

  • I created a Cluster Host with 5 host then I shut down one of them.
  • Cluster Host has “No IP” as “IP address family”
  • all Hosts have “Use the status of Checkmk Agent” as “Host Check Command”

Tests 1:
shut down one of hosts

Results 1:
Check_MK Service goes in (Service Check Timed Out)
All clustered services have gone in Stale

Test 2:
Put the shut down host in downtime and/or Acknowledge

Results 2:
Nothing change from “Results 1”

Test 3:
run cmk -v cluster-host

Results 3:
The service were updated but remained in “Stale”

Obviously removing the shutted down Host from cluster host all works as expected, but…

Maybe I understood the problem:
Abstruct from cmd:


As in image the execution_time is more then 1 minute becouse of this I think that all services remain in stale.

Note: my “Agent TCP connect timeout” parameter is set to 120s

Why? This should be not more than 2-3 seconds.
It is only the timeout for the initial connect, not the whole time needed for the transfer.

Many thanks Andreas
You’re rigth! This was the problem. Now there aren’t no stale services anymore.
But it’s possible to set “best” to Check_MK Service?

Otherwise as you can see in image it goes to critical if one of the hosts in unreachable

No way to manage this?