CRM resources monitoring has other result then expected

CRM monitoring has other result then expected. It doesn’t give critical. Expected was that “The check will report a CRITICAL state when the reported state is not Started. In addition the check can report a problem if a resource is not handled by a specified node.

Tested with the following

  • CheckMK Enterprise 1.6.0p13
  • RHEL 7.8
  • Pacemaer 1.1.21-4 & corosync 2.4.5-4
  • No rules defined for “Heartbeart CRM resource status” and “Heartbeat CRM general status”

Testing of CRM code

So I create a cluster of 3 VM and give the cluster 1 resource.

  • Killing pacemaker process doesn’t work. I only result in no information for CheckMK
  • Testing it with parameter doesn’t change the status, but gives more information about the cluster (after the fix)
    afbeelding
  • Disable the resource

pcs resource disable ClusterIP

Resulting in not started resource


The expectation was that the service check would go to warning of critical (based on the documentation)

Agent output

[root@cleint_vm02 ~]# pcs resource disable ClusterIP
[root@cleint_vm02 ~]# pcs resource
ClusterIP (ocf::heartbeat:IPaddr2): Stopped (disabled)
[root@cleint_vm02 ~]# TZ=UTC crm_mon -1 -r | grep -v ^$ | sed ‘s/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g’
Stack: corosync
Current DC: cleint_vm03 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Tue Oct 13 12:58:07 2020
Last change: Tue Oct 13 14:57:36 2020 by root via cibadmin on cleint_vm02
3 nodes configured
1 resource configured (1 DISABLED)
Online: [ cleint_vm01 cleint_vm02 devkladbgl03 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Stopped (disabled)

Side note for improvement’s to the code

I can also add all the changes as a merge request to https://github.com/tribe29/checkmk if wanted

Problem

The error for Heartbeat CRM general status is

Invalid parameter {‘max_age’: 60, ‘num_resources’: None, ‘num_nodes’: None}: %d format: a number is required, not NoneType

CRM resource status doesn’t give a weard message

Solution

Add extra line with code

elif ' ' .join(line[ 1 : 3 ]).rstrip( '.,' ).lower() == 'resource configured' :

Because my ccluster had 1 resource and the code can’t handle the different between resource and resources. There the ouput of the the code on my test setup

[root @client_vm check_mk_agent]# TZ=UTC crm_mon - 1 -r | grep -v ^$ | sed 's/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g'
Stack: corosync
Current DC: client_vm03 (version 1.1 . 20 - 5 .el7_7. 2 -3c4c782f70) - partition with quorum
Last updated: Tue Oct 13 09 : 25 : 18 2020
Last change: Thu May 21 00 : 02 : 18 2020 by root via cibadmin on client_vm01
3 nodes configured
1 resource configured
Online: [client_vm01 client_vm02 client_vm03 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started client_vm01

That is strange as “num_resources” should not be “None”. It means the parsing function at inventory time has not found any resources in your output. That is your extra line or?

If you look at the source code of this check you see that it is very old :slight_smile:
Your code change should also be extended for the Node detection as you system also don’t detected the three nodes, in your case also “None”.

It could help if the crm_mon status output is not the unstructured text. I don’t know how good the XML output looks like but it should be better parse-able.

The problem was that we had 1 resource and not had 1 resources. See the change that I made in Solution.

If it is that old, I can understand that improvements can be applied :smile:
Do you have link with examples for what is better parse-able. Maybe I can give it a try and upload to Github?

And I will edit the previous post, if you want to reproduce anything yourself.

I can have a look at the “crm_mon” XML output tomorrow.
I will let you know if this looks promising. :wink: