Pacemaker pcs v2.1.6-9 on RHEL9 breaks check_mk-heartbeat_crm

lgmills · November 15, 2023, 11:39pm

My AlmaLinux 9 systems just updated from version 9.2 to 9.3, and all of the pacemaker related packages received a small incremental update. The “pcs” command used by the linux agent has undergone a small change in the status command output format, enough to break the heartbeat-crm plugin.

With the 2.1.6-9 version of the pcs command, the plugin status reports “UNKN” and crashed, and some, but not all of the resources have invalid status.

This is on raw edition 2.1.0p20, crash report e1be3a72-840c-11ee-b0d4-94f1289ef228, filed on 11/15/2023.

Sample status command output on pcs version 2.1.5-9

[root@server1 tmp]# pcs status
Cluster name: server-cluster
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-11-15 14:33:29 -
06:00)
Cluster Summary:
  * Stack: corosync
  * Current DC: server2. (version 2.1.5-9.el9_2.3.alma.1-a3f44794f94) - p
artition with quorum
  * Last updated: Wed Nov 15 14:33:29 2023
  * Last change:  Mon Nov 13 19:51:06 2023 by root via crm_attribute on server2.f
nal.gov
  * 2 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ server1. server2. ]

Full List of Resources:
  * fence_server1        (stonith:fence_ipmilan):         Started server2.
  * fence_server2        (stonith:fence_ipmilan):         Started server1.
  * Clone Set: pgsql-clone [pgsql] (promotable):
    * Promoted: [ server2. ]
    * Unpromoted: [ server1. ]
  * pgsql-ha-vip        (ocf:heartbeat:IPaddr2):         Started server2.

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Sample output of pcs command v2.1.6.9

[root@server1 tmp]# pcs status
Cluster name: server-cluster
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: server2. (version 2.1.6-9.el9-6fdc9deea29) - partition with
 quorum
  * Last updated: Wed Nov 15 14:40:40 2023 on server1.
  * Last change:  Wed Nov 15 12:50:58 2023 by root via cibadmin on server1.
  * 2 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ server1. server2. ]

Full List of Resources:
  * fence_server1  (stonith:fence_ipmilan):         Started server2.
  * fence_server2  (stonith:fence_ipmilan):         Started server1.
  * Clone Set: pgsql-clone [pgsql] (promotable):
    * Promoted: [ server1. ]
    * Unpromoted: [ server2. ]
  * pgsql-ha-vip        (ocf:heartbeat:IPaddr2):         Started server1.

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

khoffmann · November 21, 2023, 1:27pm

We eoncountered the same problem today with CheckMK version 2.2.0p12 after upgrading pacemaker to 2.1.6-8 (upgrade from RHEL 8.8 to 8.9)

Obviously the format of the “Last updated” line changed and adds the hostname after the timestamp.

I filed crash report 857ed58c-886c-11ee-974c-e4434bb38c24 for that.

lgmills · November 21, 2023, 3:41pm

At some point I hope this is addressed by an updated plugin, but until that time, I’ve made the following not too ugly change to the Linux agent script to make the crm_mon output look like the older output so that the plugin will work. This change strips off the server info from the “Last Updated” line, and changes the unrecognized “Promoted/Unpromoted” syntax back to “Masters/Slaves”.

 TZ=UTC crm_mon -1 -r | grep -v ^$ | sed \
         -e 's/^ //; /^\sResource Group:/,$ s/^\s//; s/^\s/_/g' \
         -e '/Last updated/s/ on.*$//' \
         -e 's/Clone Set/Master\/Slave Set/' \
         -e 's/Promoted/Masters/' \
         -e 's/Unpromoted/Slaves/'

joerg.herbel · December 5, 2023, 5:50am

Hi, thanks for the reporting this, the crash has been solved:

Best,
Jörg