Call for Redfish beta testers

New version with fixes for raid controller and drive detection.
Also iLO6 should work now without problem.
This version is also multi system capable like you need it for blade centers.

Attention after installing - a rediscovery of your hosts is needed as there are checks now with item that had no item before (multi system support), also the naming for drives was changed to support systems where all drives have the same name :frowning:

2.1 → Check_MK-Things/check plugins 2.1/redfish/redfish-2.1.29.mkp at master · Yogibaer75/Check_MK-Things · GitHub
2.2 → https://github.com/Yogibaer75/Check_MK-Things/blob/master/check%20plugins%202.2/redfish/redfish-2.2.29.mkp
2.3 daily build from 09.02.2024 → Check_MK-Things/check plugins 2.3/redfish/redfish-2.3.29.mkp at master · Yogibaer75/Check_MK-Things · GitHub
For 2.3 testers keep in mind that it is possible that the version is broken with other build than the 09.02. one.

2.2 and 2.3 exchange downloads will follow the next days i think

Thanks to @khoehn & @scotsie for providing data for testing and troubleshooting.

If someone wants to provide data for further testing, please use the GitHub - DMTF/Redfish-Mockup-Creator: A Python3 program that creates a Redfish Mockup folder structure from a real live Redfish service.
to generate a zip/tar file of your management interface.
It is not needed to post the file here. Please sent me this as PM.
Pay attention some Dell interfacess produce very huge results.
Some folders like “TelemetryService” can than be removed from the archive.

Also for troubleshooting normally i need a dump made with the mockup creator.
Troubleshooting with only screenshots und some lines of text is nearly impossible at such complex data structures.

5 Likes

Now running the redfish 2.2.29

On all redfish hosts:

the rule:

Is the redfish package a Checkmk agent and could is follow the versions of the Checmk releases or is the redfish package an agent plugin and report no an “agent plugin version”.

The 2.0 makes no sense for a mkp package with version 2.2.29, for sure.

yes

no

It makes sense as the agent is the same for all versions 2.0 / 2.1 / 2.2
2.2.29 is the package with the checks for 2.2 / 2.1.29 is for 2.1 and so on

It is possible that the naming changes after integration into the core CMK, but from my side i don’t know what i should change. What you check with your rule is the installed agent version. The agent version of the Redfish plugin has nothing to do with the agent version of the “normal” CMK agent i would say.

Best solution would be limiting the shown rule to hosts with Linux or Windows label. This label will be automatically set if agent is installed.

Today, new version with only a small fix for offline interfaces. These interfaces are now ignored at discovery time. This was submitted by tfoks · GitHub

2.1 - Check_MK-Things/check plugins 2.1/redfish/redfish-2.1.30.mkp at master · Yogibaer75/Check_MK-Things · GitHub
2.2 - Check_MK-Things/check plugins 2.2/redfish/redfish-2.2.30.mkp at master · Yogibaer75/Check_MK-Things · GitHub
2.3 - build 18.02.2024 - Check_MK-Things/check plugins 2.3/redfish/redfish-2.3.30.mkp at master · Yogibaer75/Check_MK-Things · GitHub

I think there is too much work at the CMK team in Munich to review the submitted exchange packages at the moment :wink:

1 Like

Yep, sorry… Thanks Andreas!

@andreas-doehler first I also want to say thanks for bringing this to the community!!

We currently are trying to use it for monitoring a Cisco blade systems which has no fans.
And, probably valid for all devices without fans, it results in crash in function discovery_redfish_fans.

def discovery_redfish_fans(section) -> DiscoveryResult:
    """Discover single fans"""
    for key in section.keys():
        fans = section[key].get("Fans", None)
        for fan in fans:                                            # here it crashes (None is not iterable)

IMHO (but of course implementation is up to you :slight_smile: ) both functions (discovery and check) should return an empty list when trying to get the fans:

def discovery_redfish_fans(section) -> DiscoveryResult:
    """Discover single fans"""
    for key in section.keys():
        fans = section[key].get("Fans", [])

        for fan in fans:
            if fan.get("Status", {}).get("State") == "Absent":
                continue
            fan_name = _fan_item_name(fan)
            if fan_name:
                yield Service(item=fan_name)


def check_redfish_fans(item: str, section) -> CheckResult:
    """Check single fan state"""
    fan = None
    for key in section.keys():
        fans = section[key].get("Fans", [])

        for fan_data in fans:
            fan_name = _fan_item_name(fan_data)
            if fan_name == item:
                fan = fan_data
                break

.....

Maybe you can fix this in your next release.

And again… thanks a lot!!
Christian

1 Like

Agree. Then the CheckMK rule shouldn’t match the non-“normal” CMK agents imho.

Might be the best option for now. I doubt CMK will fix anything.

Which, surprise, can’t be in a single rule :frowning:

Like i said please use the system labels for Linux and Windows for your rule.

In 2.3, you can combine host labels with, e.g. OR condition. In the mean time, you will probably need two rules

Got an iDrac 9 setup for redfish, and did the steps needed on the site. Inventoried and it went well with one exception. The server I’m working on is a Dell R760, two drives in the flex bay, twelve drives on the front. None of the physical drives are seen.

redfish_storage   AHCI.Embedded.1-1     Storage Controller AHCI.Embedded.1-1         
redfish_storage   AHCI.Embedded.2-1     Storage Controller AHCI.Embedded.2-1         
redfish_storage   RAID.Slot.3-1         Storage Controller RAID.Slot.3-1

Those are the only storage related checktypes I’m seeing. I did set the redfish user (on the iDrac) to read-only, operator and administrator so it doesn’t seem to be permission related at the host side.

Ran the agent_redfish to get just the PhysicalDrives but I don’t see any info about the physical drives in what is returned.

Is this expected at this point in the development of redfish checks or is there a config issue somewhere in my setup?

It is not expected - first some questions

  • what version of the mkp do you used?
  • for debugging of not found devices i need the real redfish data of the device like described here

redfish-2.2.30.mkp
check-mk-raw-2.2.0p22-el8-38.x86_64.rpm
Rocky Linux 8.9

Working on getting the mockup data now. What part of that mockup can be removed besides the TelemetryService? Particularly interested in sanitizing the data for any security related data.

Got the mockup built, saved it to a tar.gz. The PM feature won’t let me attach it here. Should I change the extension to spoof it as a jpg or something?

You can also sent as PM some link where you can provide the file.
From the folder here is an example with the needed ones.

drwxr-xr-x   7 root root  4096 Dec 11 10:27  AccountService     <--not needed
drwxr-xr-x   3 root root  4096 Dec 11 10:27  CertificateService    <-- not needed
drwxr-xr-x   5 root root  4096 Dec 11 10:27  Chassis
drwxr-xr-x   2 root root  4096 Dec 11 10:27  ComponentIntegrity   <-- not needed
drwxr-xr-x   3 root root  4096 Dec 11 10:27  EventService  <-- not needed
drwxr-xr-x   3 root root  4096 Dec 11 10:27  Fabrics
drwxr-xr-x   3 root root  4096 Dec 11 10:27  JobService   <-- not needed
drwxr-xr-x 739 root root 36864 Dec 11 10:27  JsonSchemas   <-- not needed
drwxr-xr-x   3 root root  4096 Dec 11 10:27  Managers
drwxr-xr-x   3 root root  4096 Dec 11 10:27  SessionService
drwxr-xr-x   3 root root  4096 Dec 11 10:27  Systems
-rw-r--r--   1 root root  2465 Dec 11 10:27  index.json
drwxr-xr-x   2 root root  4096 Dec 11 10:27  odata

If you really want, you can also replace all the included serial numbers with some placeholders, but i don’t know if this is a work someone wants to do.
More realistic is a search and replace for hostnames / IPs.

Checked with your dump and i see all the hard drives.


Tested with 2.2.30 mkp.

What you can test is the manual agent execution on command line with --debug and -v.
Then you see also the time needed for every query and also if it gets an timeout on a single query.

Just ran “cmk -vII <host” and got:

  1 checkmk_agent
  4 redfish_ethernetinterfaces
 14 redfish_fans
  8 redfish_memory
  1 redfish_memory_summary
  2 redfish_networkadapters
  2 redfish_processors
  2 redfish_psu
  3 redfish_storage
  1 redfish_system
  4 redfish_temperatures
  8 redfish_voltage
SUCCESS - Found 50 services

Even though the data is in there I’m not getting the services inventoried for some reason. It looks like the debug is showing the physical drives when I run the agent_redfish. Tried redoing the inventory from command line and the web UI, no luck.

Ran the agent manually just against the physical drives. The individual drives in the system do show up there.

$  /omd/sites/cmk1009ping/local/share/check_mk/agents/special/agent_redfish --debug --verbose  -u username -s password -m PhysicalDrives ipaddr > output.txt
INFO 2024-03-01 12:48:59 root: running file /omd/sites/cmk1009ping/lib/python3/cmk/special_agents/utils/agent_common.py
INFO 2024-03-01 12:48:59 root: using Python interpreter v3.11.5.final.0 at /omd/sites/cmk1009ping/bin/python3
INFO 2024-03-01 12:48:59 redfish: Redfish API
INFO 2024-03-01 12:48:59 redfish.rest.v1: Attempt 1 of /redfish/v1
INFO 2024-03-01 12:48:59 redfish.rest.v1: Response Time for GET to /redfish/v1: 0.13773563038557768 seconds.
INFO 2024-03-01 12:48:59 redfish.rest.v1: Attempt 1 of /redfish/v1/SessionService/Sessions
INFO 2024-03-01 12:48:59 redfish.rest.v1: Response Time for GET to /redfish/v1/SessionService/Sessions: 0.039644286036491394 seconds.
INFO 2024-03-01 12:48:59 redfish.rest.v1: Attempt 1 of /redfish/v1
INFO 2024-03-01 12:48:59 redfish.rest.v1: Response Time for GET to /redfish/v1: 0.03846895322203636 seconds.
INFO 2024-03-01 12:48:59 redfish.rest.v1: Attempt 1 of /redfish/v1/Managers
INFO 2024-03-01 12:49:00 redfish.rest.v1: Response Time for GET to /redfish/v1/Managers: 0.030635480768978596 seconds.
INFO 2024-03-01 12:49:00 redfish.rest.v1: Attempt 1 of /redfish/v1/Managers/iDRAC.Embedded.1
INFO 2024-03-01 12:49:00 redfish.rest.v1: Response Time for GET to /redfish/v1/Managers/iDRAC.Embedded.1: 0.12559902109205723 seconds.
INFO 2024-03-01 12:49:00 redfish.rest.v1: Attempt 1 of /redfish/v1/Systems
INFO 2024-03-01 12:49:00 redfish.rest.v1: Response Time for GET to /redfish/v1/Systems: 0.3174349255859852 seconds.
INFO 2024-03-01 12:49:00 redfish.rest.v1: Attempt 1 of /redfish/v1/Systems/System.Embedded.1
INFO 2024-03-01 12:49:00 redfish.rest.v1: Response Time for GET to /redfish/v1/Systems/System.Embedded.1: 0.26061029825359583 seconds.
INFO 2024-03-01 12:49:00 redfish.rest.v1: Attempt 1 of /redfish/v1/Chassis
INFO 2024-03-01 12:49:00 redfish.rest.v1: Response Time for GET to /redfish/v1/Chassis: 0.044996283017098904 seconds.
INFO 2024-03-01 12:49:00 redfish.rest.v1: Attempt 1 of /redfish/v1/Chassis/System.Embedded.1
INFO 2024-03-01 12:49:00 redfish.rest.v1: Response Time for GET to /redfish/v1/Chassis/System.Embedded.1: 0.13854458834975958 seconds.
INFO 2024-03-01 12:49:00 redfish.rest.v1: Attempt 1 of /redfish/v1/Chassis/Enclosure.Internal.0-0:RAID.Slot.3-1
INFO 2024-03-01 12:49:00 redfish.rest.v1: Response Time for GET to /redfish/v1/Chassis/Enclosure.Internal.0-0:RAID.Slot.3-1: 0.06682244502007961 seconds.
INFO 2024-03-01 12:49:00 redfish.rest.v1: Attempt 1 of /redfish/v1/Chassis/Enclosure.Internal.0-1:RAID.Slot.3-1
INFO 2024-03-01 12:49:01 redfish.rest.v1: Response Time for GET to /redfish/v1/Chassis/Enclosure.Internal.0-1:RAID.Slot.3-1: 0.07414923142641783 seconds.

I can share the output via dropbox if you think it’ll be useful. Going to dig in and work up some command line tools to further explore this stuff next week.

One idea, do you have selected something inside the agent config what it should fetch?
In my test i don’t select anything, this should lead to fetch all the available data.
But in a production environment it can lead to longer runtime.

I did have that box checked and then all the items were also checked. So got it all unchecked and now the drives show up as expected. I had thought the rule wouldn’t check for any services if nothing was selected. It’s not exactly clear that those check boxes are overriding the default check for everything.

Ok so I need to test if there is a selection problem with the rule.