I could not find a similar question in the docs or other community posts, so I’m trying to make my own.
I would like to monitor a cluster with more than two nodes, in my case 6. I can’t seem to understand how to set how many nodes can fail in the cluster before raising a CRIT alert.
Example:
I have my six hosts, with their services and I have already set wich service is to be considered part of the cluster as below:
host1 → clustered_service
host2 → clustered_service
host3 → clustered_service
host4 → clustered_service
host5 → clustered_service
host6 → clustered_service
So I created my cluster host in which now I can see the “clustered_service” (which has now correctly disappeared from the cluster nodes).
Now I would like to raise an alert when this service fails on more than one node: so service failed on one node → OK, service failed on two nodes → CRIT.
How can I do this? Can this threshold be set arbitrarily?
The behavior depends on the check.
Please see Add predefined cluster modes for all services or have a look in the man page of the check.
In worst case you have to consult the code of the check to understand the cluster logic behind.
You’re right. It’s not actually a managed cluster, but a farm of identical servers exposing the same services. On top of them there’s a VIP which dispatches every request to one of the servers based on some policies.
Yes, this is not a “cluster” for Checkmk.
A cluster for Checkmk is a system where a service check migrates between the nodes (e.g. corosync/pacemaker or Windows Cluster).
Sorry, I completely disagree. With the help of the cluster function in plugin API 1.0 you can represent any kind of clusters even a farm of services, probably with a load balancer in front where you check e.g. that a certain amount of services are in OK state.
Here a bare example which checks that a software state is running on at least 3 nodes. This value could be parametrized and a rule could be build to be flexible.
def cluster_check_my_cluster_check(section):
running_nodes = []
connections = 0
for _node_name, node_section in section.items():
node_connections = int(node_section[1][0])
connections += node_connections
yield Result(state=State.OK, summary=f"Software running {connections} connections")
def cluster_check_my_cluster_check(section):
running_nodes = []
connections = 0
for _node_name, node_section in section.items():
run_state = node_section[0][0]
node_connections = int(node_section[1][0])
if run_state == 'running':
running_nodes.append(_node_name)
connections += node_connections
if len(running_nodes) < 3:
clustered_section = [['stopped'],[connections]]
else:
clustered_section = [['running'],[connections]]
yield from check_my_cluster_check(clustered_section)
In the ruleset “Aggregation options for clustered services” already three different aggregation modes (“failover” , “worst” , “best” ) predefined
This is no longer true, there is an Aggregation option Rule for Clusters where you can also define Best, Worst Aggregations for Service in a Cluster, without need for BI. BI is just more flexible relating the Aggregations, but has other disadvantages.
Ok, thank you.
However, here I have CheckMK 2.0 and no option to upgrade to 2.1 in the near future.
If I understand correctly then, my only option with version 2.0 would be to use Business Intelligence. Correct?
I would strongly recommend waiting for the Update to the next Checkmk Version, BI is way more complicated and will not automatically appear as Service, and has at some point in bigger Setups a bad performance because it’s generated basically in the frontend and needs to query data using Livestatus.
On the Other hand, with the Aggregation Feature of the Cluster, you don’t have these Problems and out of the Box a real Service which acts as every other Service you already know.
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.