CRM resources monitoring has other result then expected

keylane_sbaas · October 13, 2020, 4:32pm

CRM monitoring has other result then expected. It doesn’t give critical. Expected was that “The check will report a CRITICAL state when the reported state is not Started. In addition the check can report a problem if a resource is not handled by a specified node.”

Tested with the following

CheckMK Enterprise 1.6.0p13
RHEL 7.8
Pacemaer 1.1.21-4 & corosync 2.4.5-4
No rules defined for “Heartbeart CRM resource status” and “Heartbeat CRM general status”

Testing of CRM code

So I create a cluster of 3 VM and give the cluster 1 resource.

Killing pacemaker process doesn’t work. I only result in no information for CheckMK
Testing it with parameter doesn’t change the status, but gives more information about the cluster (after the fix)

afbeelding181×177 15.8 KB
Disable the resource

pcs resource disable ClusterIP

Resulting in not started resource

The expectation was that the service check would go to warning of critical (based on the documentation)

Agent output

[root@cleint_vm02 ~]# pcs resource disable ClusterIP
[root@cleint_vm02 ~]# pcs resource
ClusterIP (ocf:IPaddr2): Stopped (disabled)
[root@cleint_vm02 ~]# TZ=UTC crm_mon -1 -r | grep -v ^$ | sed ‘s/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g’
Stack: corosync
Current DC: cleint_vm03 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Tue Oct 13 12:58:07 2020
Last change: Tue Oct 13 14:57:36 2020 by root via cibadmin on cleint_vm02
3 nodes configured
1 resource configured (1 DISABLED)
Online: [ cleint_vm01 cleint_vm02 devkladbgl03 ]
Full list of resources:
ClusterIP (ocf:IPaddr2): Stopped (disabled)

Side note for improvement’s to the code

I can also add all the changes as a merge request to GitHub - Checkmk/checkmk: Checkmk - Best-in-class infrastructure & application monitoring if wanted

Problem

The error for Heartbeat CRM general status is

Invalid parameter {‘max_age’: 60, ‘num_resources’: None, ‘num_nodes’: None}: %d format: a number is required, not NoneType

CRM resource status doesn’t give a weard message

Solution

Add extra line with code

elif ' ' .join(line[ 1 : 3 ]).rstrip( '.,' ).lower() == 'resource configured' :

Because my ccluster had 1 resource and the code can’t handle the different between resource and resources. There the ouput of the the code on my test setup

[root @client_vm check_mk_agent]# TZ=UTC crm_mon - 1 -r | grep -v ^$ | sed 's/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g'
Stack: corosync
Current DC: client_vm03 (version 1.1 . 20 - 5 .el7_7. 2 -3c4c782f70) - partition with quorum
Last updated: Tue Oct 13 09 : 25 : 18 2020
Last change: Thu May 21 00 : 02 : 18 2020 by root via cibadmin on client_vm01
3 nodes configured
1 resource configured
Online: [client_vm01 client_vm02 client_vm03 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started client_vm01

andreas-doehler · October 14, 2020, 5:36am

That is strange as “num_resources” should not be “None”. It means the parsing function at inventory time has not found any resources in your output. That is your extra line or?

If you look at the source code of this check you see that it is very old
Your code change should also be extended for the Node detection as you system also don’t detected the three nodes, in your case also “None”.

It could help if the crm_mon status output is not the unstructured text. I don’t know how good the XML output looks like but it should be better parse-able.

keylane_sbaas · October 14, 2020, 2:02pm

The problem was that we had 1 resource and not had 1 resources. See the change that I made in Solution.

If it is that old, I can understand that improvements can be applied
Do you have link with examples for what is better parse-able. Maybe I can give it a try and upload to Github?

And I will edit the previous post, if you want to reproduce anything yourself.

andreas-doehler · October 14, 2020, 2:45pm

I can have a look at the “crm_mon” XML output tomorrow.
I will let you know if this looks promising.

keylane_sbaas · October 21, 2020, 1:43pm

Hi Andreas,

How did it go with the XML output?

I was checking the code again and begin searching for “num_resources” and “inventory_heartbeat_crm_resources”. Because this is the variable that is used to create dictionary with all the resource(s) in it. The weard thing is that this is the only code that does something with the status of the resource

def inventory_heartbeat_crm_resources(info):
# Full list of resources:
# Resource Group: group_slapmaster
# resource_virtip1 (ocf:IPaddr): Started mwp
# resource_virtip2 (ocf:IPaddr): Started mwp
# resource_pingnodes (ocf:pingd): Started mwp
# resource_slapmaster (ocf:OpenLDAP): Started mwp
# resource_slapslave (ocf:OpenLDAP): Started smwp
inventory =
settings = host_extra_conf_merged(host_name(), inventory_heartbeat_crm_rules)
for name, resources in heartbeat_crm_parse_resources(info).iteritems():
# In naildown mode only resources which are started somewhere can be
# inventorized
if settings.get(‘naildown_resources’, False) and resources[0][2] != ‘Stopped’:
inventory.append((name, ‘“%s”’ % resources[0][3]))
else:
inventory.append((name, None))
return inventory

system · October 21, 2021, 1:43pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.