Monitor cluster with more than two nodes

scarpatommaso · July 5, 2023, 9:06am

Hi all,

I could not find a similar question in the docs or other community posts, so I’m trying to make my own.
I would like to monitor a cluster with more than two nodes, in my case 6. I can’t seem to understand how to set how many nodes can fail in the cluster before raising a CRIT alert.

Example:
I have my six hosts, with their services and I have already set wich service is to be considered part of the cluster as below:
host1 → clustered_service
host2 → clustered_service
host3 → clustered_service
host4 → clustered_service
host5 → clustered_service
host6 → clustered_service

So I created my cluster host in which now I can see the “clustered_service” (which has now correctly disappeared from the cluster nodes).
Now I would like to raise an alert when this service fails on more than one node: so service failed on one node → OK, service failed on two nodes → CRIT.

How can I do this? Can this threshold be set arbitrarily?

Thank you in advance for your help

mike1098 · July 5, 2023, 9:10am

Hello,

The behavior depends on the check.
Please see Add predefined cluster modes for all services or have a look in the man page of the check.
In worst case you have to consult the code of the check to understand the cluster logic behind.

r.sander · July 5, 2023, 10:20am

Is the service running on more than one node?

Do you really have a cluster? Or a farm of identical servers?

A cluster in the checkmk sense is a system with a cluster management software that is migrating a clustered service between the nodes.

If you have a farm of nodes the Business Intelligence is better suited for your needs.

scarpatommaso · July 5, 2023, 10:37am

You’re right. It’s not actually a managed cluster, but a farm of identical servers exposing the same services. On top of them there’s a VIP which dispatches every request to one of the servers based on some policies.

Is creating a cluster really such a bad idea?

r.sander · July 5, 2023, 2:55pm

Yes, this is not a “cluster” for Checkmk.
A cluster for Checkmk is a system where a service check migrates between the nodes (e.g. corosync/pacemaker or Windows Cluster).

mike1098 · July 5, 2023, 4:14pm

Sorry, I completely disagree. With the help of the cluster function in plugin API 1.0 you can represent any kind of clusters even a farm of services, probably with a load balancer in front where you check e.g. that a certain amount of services are in OK state.

Here a bare example which checks that a software state is running on at least 3 nodes. This value could be parametrized and a rule could be build to be flexible.

def cluster_check_my_cluster_check(section):
    running_nodes = []
    connections = 0
    for _node_name, node_section in section.items():
        node_connections = int(node_section[1][0])
        connections += node_connections
    yield Result(state=State.OK, summary=f"Software running {connections} connections")


def cluster_check_my_cluster_check(section):
    running_nodes = []
    connections = 0
    for _node_name, node_section in section.items():
        run_state = node_section[0][0]
        node_connections = int(node_section[1][0])
        if run_state == 'running':
            running_nodes.append(_node_name)
        connections += node_connections
    if len(running_nodes) < 3:
        clustered_section = [['stopped'],[connections]]
    else:
        clustered_section = [['running'],[connections]]
    yield from check_my_cluster_check(clustered_section)

In the ruleset “Aggregation options for clustered services” already three different aggregation modes (“failover” , “worst” , “best” ) predefined

BR

Michael

bkuhn · July 5, 2023, 5:49pm

This is no longer true, there is an Aggregation option Rule for Clusters where you can also define Best, Worst Aggregations for Service in a Cluster, without need for BI. BI is just more flexible relating the Aggregations, but has other disadvantages.

mike1098 · July 6, 2023, 7:06am

Thank you, maybe I was not so clear about that in my post above

bkuhn · July 6, 2023, 7:23am

Oh sorry, after checking again I see, your Post ended with that. The Source Code I guess confused while reading

scarpatommaso · July 6, 2023, 7:40am

In my Setup panel I don’t see any “Aggregation options for clustered services” ruleset. Is this available in CheckMK version 2.0?

mike1098 · July 6, 2023, 7:42am

No, please reed Werk #12908

scarpatommaso · July 6, 2023, 10:53am

Ok, thank you.
However, here I have CheckMK 2.0 and no option to upgrade to 2.1 in the near future.
If I understand correctly then, my only option with version 2.0 would be to use Business Intelligence. Correct?

bkuhn · July 7, 2023, 6:17am

I would strongly recommend waiting for the Update to the next Checkmk Version, BI is way more complicated and will not automatically appear as Service, and has at some point in bigger Setups a bad performance because it’s generated basically in the frontend and needs to query data using Livestatus.

On the Other hand, with the Aggregation Feature of the Cluster, you don’t have these Problems and out of the Box a real Service which acts as every other Service you already know.

system · July 6, 2024, 6:18am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.