Call for Redfish beta testers

Hey!
We are currently planning to support monitoring via Redfish. Redfish is an interface for the remote maintenance of servers.
Most new management boards come with Redfish support (e.g. Dell iDRAC8, HPE iLO4, Lenovo XClarity, Supermicro X10, Cisco IMC) and monitoring via Redfish has become a great alternative to monitoring via IPMI/SNMP here.

Our plan is to adopt the generic Redfish Special Agent from @andreas-doehler into Checkmk (Checkmk Exchange). Andreas already uses it for the data centers that he monitors and it works very well there so far.
The goal would be for us to be able to monitor all servers that speak Redfish with one special agent. The nice thing is that you then get the same services for all servers and can use the same rules everywhere.

We are now looking for more people to test the monitoring via Redfish in the field and let us know whether everything works as expected. Any insights are helpful (e.g. sensors, disks etc. missing).


Recommendation: Do on a test site!
To get started with the tests, all you need to do is

  • install the MKP (Checkmk Exchange)
  • install the redfish python package as the SITE user: pip3 install 'urllib3<2' redfish
  • activate Redfish on the servers, if not already active
  • create a host for each management board you want to monitor via Redfish and configure the special agent rule “Redfish Compatible Management Controller” for that host

Note: Testing this on the Checkmk appliance is currently not possible due to the missing package.


Minimum requirements:

  • HPE iLO5 (iLO4 only with newest firmware due to performance issues)
  • Dell iDRAC v9 (v8 works, but is too slow)
  • Cisco CIMC (currently no insights on minimum version requirements)
  • Supermicro BMC (currently no insights on minimum version requirements)
  • Nutanix (currently no insights on minimum version requirements)
  • Lenovo (currently no insights on minimum version requirements)

The monitoring is built for management boards. Any other device with a ‘Redfish’ is not supported.


Testing of Redfish agent and reporting a problem.

If a connection problem exists.

  • on the command line execute the agent with the switches “–debug” and “-vv” to get a maximum an output. It is possible that here already the real problem is shown (credential problem, slow connection or generic problem)

If only data is missing that was expected, like hard drives, memory modules and so on.

  • on the command line execute the agent with the switches “–debug” and “-vv” to get a maximum an output.
  • Inspect the sections in the output if the missed section is there or not
  • If section exists and is only not shown as checks, i need this section output to take a look
  • If section is missing i need the complete agent output - the section “redfish_system” and “redfish_chassis” are the minimum i need

If no real good output is generated, it is possible to create a dump for the complete Redfish interface.
To achieve this, there is a small tool existing.

The output can be compressed into one archive and Andreas can check if the needed data is in whats provided by the interface.


Happy monitoring and thanks to Andreas for his great contribution and free service to the community here. We would like to merge this once we have received sufficient feedback from users in the field.
Cheers, Martin

10 Likes

Hello,

we installed it today as SITE User on a satellite in distributed environment and run in errors.
It was not possible to activate new changes on the affected SITE, even creating new hosts.

Then we uninstalled redfish, urllib3, boto3 and botocore which were installed and upgraded before by us.

Installed version was:
redfish 3.2.1
urllib3 2.0.7
boto3-1.29.0
botocore 1.32.0

omd reload 2x times needed to “fix” this error

We running on 2.2.0p8, running on Ubuntu 22.04.03

If more information are needed, feel free to ask for it.

Is it possible, that you installed something as the root user, rather than the site user?

Sadly true, and right now we did not uninstalled the packages which was installed as root user.
We just uninstalled the packages for the SITE User, but not with the root user.

I guess it is recommended to uninstall them as root user as well?

Apologies. I added some notes (e.g. try first on a test site; install as site user).

Can you share the error messages, which you got?

2 Likes

Sure, is there a log where I can check this?

Maybe we did additional mistakes:

We installed 1st as root user:
pip3 install ‘urllib3<2’ redfish

We would like to monitor a Raritan PX4 device and got the error “ModuleNotFoundError: No module named ‘redfish’”

Then we switched to SITE user and run the following pip3 commands:
pip3 install redfish
pip3 install ‘urllib3<2’
pip3 install redfish
pip3 uninstall ‘urllib3<2’
pip3 install boto3/boto
pip3 install boto3
pip3 list
pip3 list | grep urllib
pip3 list | grep boto
pip3 install redfish
pip3 install boto3
pip3 install boto3 --upgrade
pip3 list | grep boto
pip3 list | grep urllib
pip3 install ‘urllib3<2’
pip3 install urllib3<2
pip3 install ‘urllib3<2’ redfish
pip3 install ‘urllib3<2’ redfish --upgrade
pip3 uninstall boto3
pip3 uninstall botocore
pip3 uninstall urlib3
pip3 uninstall urllib3
pip3 uninstall redfish

The following error occurred after “pip3 install redfish”
DEPRECATION: jsonpath-rw is being installed using the legacy ‘setup.py install’ method, because it does not have a ‘pyproject.toml’ and the ‘wheel’ package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the ‘–use-pep517’ option. Discussion can be found at Deprecate call to `setup.py install` when `wheel` is absent for source distributions without pyproject.toml · Issue #8559 · pypa/pip · GitHub
Running setup.py install for jsonpath-rw … done
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
botocore 1.20.102 requires urllib3<1.27,>=1.25.4, but you have urllib3 2.1.0 which is incompatible.

This is the problem.
@martin.hirschvogel gave the correct pip3 install syntax to avoid this incompatible version of urllib3.
To fix you can do a “pip3 install ‘urllib3<2’ --upgrade”.

I updated my last post.

We run this command “pip3 install ‘urllib3<2’ redfish --upgrade”

Does this answer your advice? Or does the --uprade only focus on redfish, but not ‘urllib3<2’?

The “–upgrade” is only as needed if other/newer versions of the libs are already installed. It is more a downgrade then.
If you get an error with botocore like in your output here

Then there are too many libs installed.
If you take a look at the folder “~/local/lib/python3/” your should see only the following entries. If no other things are installed than the Redfish bindings.

Nov 15 21:21 certifi-2023.7.22.dist-info/
Nov 15 21:21 certifi/
Nov 15 21:21 charset_normalizer-3.3.2.dist-info/
Nov 15 21:21 charset_normalizer/
Nov 15 21:21 decorator-5.1.1.dist-info/
Nov 15 21:21 decorator.py
Nov 15 21:21 idna-3.4.dist-info/
Nov 15 21:21 idna/
Nov 15 21:21 jsonpatch-1.33.dist-info/
Nov 15 21:21 jsonpatch.py
Nov 15 21:21 jsonpath_rw-1.4.0-py3.11.egg-info/
Nov 15 21:21 jsonpath_rw/
Nov 15 21:21 jsonpointer-2.4.dist-info/
Nov 15 21:21 jsonpointer.py
Nov 15 21:21 ply-3.11.dist-info/
Nov 15 21:21 ply/
Nov 15 21:21 redfish-3.2.1.dist-info/
Nov 15 21:21 redfish/
Nov 15 21:21 requests_toolbelt-1.0.0.dist-info/
Nov 15 21:21 requests_toolbelt/
Nov 15 21:21 requests_unixsocket-0.3.0.dist-info/
Nov 15 21:21 requests_unixsocket/
Nov 15 21:21 requests-2.31.0.dist-info/
Nov 15 21:21 requests/
Nov 15 21:21 six-1.16.0.dist-info/
Nov 15 21:21 six.py
Nov 15 21:21 urllib3-1.26.18.dist-info/
Nov 15 21:21 urllib3/

This was a test i had done now.

Hi,
that is great news!

For now - correct me if I’m wrong - we need to add an extra host for the management board to be monitored with this agent. It is not another protocol in management boards’ list (like SNMP and IPMI). Would it be possible to add it to this list ? Alternatively, would it be possible to use the “Additional IPv4 addresses” field in the host properties to avoid adding extra hosts ?

Cheers
Yvan

1 Like

Yes, you need to create an extra host for mgmt boards. I added further instructions to make clear how to use it.

  • create a host for each management board you want to monitor via Redfish and configure the rule “Redfish Compatible Management Controller” for that host

While mgmt boards are quite closely connected to a server, they should be two different logical entities from a monitoring perspective.
We mixed this up when we added the option “Management board” in the “Host properties” (which was a customer request).
This has lead to many problems down the road, because it leads to false assumptions in monitoring and what we believe incorrect alerting (we are having intense discussions internally and with partners if this way should be deprecated).
We internally do therefore the following and recommend this as a best practice:

4 Likes

I fully agree to what Martin is writing here. The BMC of a server is a totally separated device and has nothing todo with the OS running on the hardware. The BMC is accessible even if the OS is not running. This way even if the CPU is burned you can monitor and see what is the issue.
To manage that we use the hostname and add _mgmt for the BMC host object.

regards

Michael

4 Likes

I can see your point. I just wanted to have less hosts to manage. If it is the recommended way of managing those systems then I’m happy with that.

Thanks to @andreas-doehler for implementing this agent! It’s really helpful.

Cheers
Yvan

I can only support the opinion of @mike1098 and @martin.hirschvogel.
In my managed systems all management controllers are separate host objects.
Mostly it is something like “hostname-ilo” or “hostname-idrac”.

5 Likes

Can you give a few examples where you think this has led to problems and false assumption and incorrect alerting ?

The BMC only monitors the hardware/firmware of the affected server independently of the OS. Nevertheless, it is still the same system. If you create two hosts for this purpose, the relationship between the hosts may be missing. An example: A RAM module fails on the server. With current large RAM modules and depending on the server config, a large percentage of RAM can be lost. The remaining RAM may then fill up in the OS. Two alarms may then appear for different hosts. Depending on the size of a company and the number of monitored systems, the Ops team that handles the alarms and tickets for downstream departments will then create two tickets. One for the infra team (BMC alarm) and one for the sysadmins (OS alarm). This means that there is actually no connection between the one problem that is caused by the other. (Yes, I know that’s a bit exaggerated)

You could create a relationship via business intelligence and corresponding rules, but the effort is much higher.

1 Like

The key reason was mentioned by @mike1098

One example for a technical problem:
BMC is not very powerful. Response via IPMI (can) take very long, e.g. >1min. Host represents monitoring of OS via agent + mgmt board via IPMI. Host is checked every minute, thus every minute IPMI is being queried. Overlapping requests to IPMI → Boom. Typical resolution for any device, which can’t respond in time (within 1min to a request…), decrease frequency of polls. But you can’t uncouple checking of agents per host (technically you can do everything, but that creates then again a lot of problem down the road)

Then there are inconsistency problems, which created several support requests already:
The Mgmt Board can and will have the same services as the OS, e.g. Uptime. There is a big difference between OS Uptime and Mgmt Board Uptime. For that reason, we prefix Mgmt Board Services which are configured via host properties with “Management Interface:”
However, the same BMC monitored independently, will not have the same prefix, which confuses users and the rules they created for monitoring.

As a side effect of this all, the technical debt generated by allowing two ways for monitoring one thing is substantial on our side (implementing DNS cache twice…).

Which maybe helps understand why we internally have very intense discussions regarding the mgmt board monitoring via host properties.

4 Likes

At least for IBM/Lenovo I can report that the special agent runs below 2 seconds in case IPMI LAN 2.0 is used (aprox 1000 units) and below 20 Seconds for very old server not supporting LAN 2.0. (aprox 300 units).

regards

Michael

Hi @martin.hirschvogel, do you also plan integration of Andreas’ Redfish Checks für Lenovo XClarity, at least somewhere further down the roadmap? (Checkmk Exchange)

I know this was not the original call for beta testing, but I can confirm that this works nicely, as long as you use pip3 install 'urllib3<2' redfish for this plugin, too.

Kind regards,
Dirk.

This will 99% not happen as we target a generic Redfish agent, which works for everything. That’s why we need the testing. Please try the generic agent and see if that works for you (I wouldn’t be surprised if Andreas isn’t also already using it himself for Lenovo).

1 Like

The universal redfish agent has all the checks included from the xClarity agent.
That means with the universal agent you should get the same information and a bit more than with the xClarity one.

5 Likes