CRM monitoring has other result then expected. It doesn’t give critical. Expected was that “The check will report a CRITICAL state when the reported state is not Started. In addition the check can report a problem if a resource is not handled by a specified node.”
Tested with the following
- CheckMK Enterprise 1.6.0p13
- RHEL 7.8
- Pacemaer 1.1.21-4 & corosync 2.4.5-4
- No rules defined for “Heartbeart CRM resource status” and “Heartbeat CRM general status”
Testing of CRM code
So I create a cluster of 3 VM and give the cluster 1 resource.
- Killing pacemaker process doesn’t work. I only result in no information for CheckMK
- Testing it with parameter doesn’t change the status, but gives more information about the cluster (after the fix)
- Disable the resource
pcs resource disable ClusterIP
Resulting in not started resource
The expectation was that the service check would go to warning of critical (based on the documentation)
Agent output
[root@cleint_vm02 ~]# pcs resource disable ClusterIP
[root@cleint_vm02 ~]# pcs resource
ClusterIP (ocf:IPaddr2): Stopped (disabled)
[root@cleint_vm02 ~]# TZ=UTC crm_mon -1 -r | grep -v ^$ | sed ‘s/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g’
Stack: corosync
Current DC: cleint_vm03 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Tue Oct 13 12:58:07 2020
Last change: Tue Oct 13 14:57:36 2020 by root via cibadmin on cleint_vm02
3 nodes configured
1 resource configured (1 DISABLED)
Online: [ cleint_vm01 cleint_vm02 devkladbgl03 ]
Full list of resources:
ClusterIP (ocf:IPaddr2): Stopped (disabled)
Side note for improvement’s to the code
I can also add all the changes as a merge request to GitHub - Checkmk/checkmk: Checkmk - Best-in-class infrastructure & application monitoring if wanted
Problem
The error for Heartbeat CRM general status is
Invalid parameter {‘max_age’: 60, ‘num_resources’: None, ‘num_nodes’: None}: %d format: a number is required, not NoneType
CRM resource status doesn’t give a weard message
Solution
Add extra line with code
elif
' '
.join(line[
1
:
3
]).rstrip(
'.,'
).lower() ==
'resource configured'
:
Because my ccluster had 1 resource and the code can’t handle the different between resource and resources. There the ouput of the the code on my test setup
[root
@client_vm
check_mk_agent]# TZ=UTC crm_mon -
1
-r | grep -v ^$ | sed
's/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g'
Stack: corosync
Current DC: client_vm03 (version
1.1
.
20
-
5
.el7_7.
2
-3c4c782f70) - partition with quorum
Last updated: Tue Oct
13
09
:
25
:
18
2020
Last change: Thu May
21
00
:
02
:
18
2020
by root via cibadmin on client_vm01
3
nodes configured
1
resource configured
Online: [client_vm01 client_vm02 client_vm03 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started client_vm01