Hey!
We are currently planning to support monitoring via Redfish. Redfish is an interface for the remote maintenance of servers.
Most new management boards come with Redfish support (e.g. Dell iDRAC8, HPE iLO4, Lenovo XClarity, Supermicro X10, Cisco IMC) and monitoring via Redfish has become a great alternative to monitoring via IPMI/SNMP here.
Our plan is to adopt the generic Redfish Special Agent from @andreas-doehler into Checkmk (Checkmk Exchange). Andreas already uses it for the data centers that he monitors and it works very well there so far.
The goal would be for us to be able to monitor all servers that speak Redfish with one special agent. The nice thing is that you then get the same services for all servers and can use the same rules everywhere.
We are now looking for more people to test the monitoring via Redfish in the field and let us know whether everything works as expected. Any insights are helpful (e.g. sensors, disks etc. missing).
Recommendation: Do on a test site!
To get started with the tests, all you need to do is
install the redfish python package as the SITE user: pip3 install 'urllib3<2' redfish
activate Redfish on the servers, if not already active
create a host for each management board you want to monitor via Redfish and configure the rule âRedfish Compatible Management Controllerâ for that host
Note: Testing this on the Checkmk appliance is currently not possible due to the missing package.
Happy monitoring and thanks to Andreas for his great contribution here, which we would like to merge once we have received sufficient feedback from users in the field.
Cheers, Martin
we installed it today as SITE User on a satellite in distributed environment and run in errors.
It was not possible to activate new changes on the affected SITE, even creating new hosts.
Then we uninstalled redfish, urllib3, boto3 and botocore which were installed and upgraded before by us.
Installed version was:
redfish 3.2.1
urllib3 2.0.7
boto3-1.29.0
botocore 1.32.0
omd reload 2x times needed to âfixâ this error
We running on 2.2.0p8, running on Ubuntu 22.04.03
If more information are needed, feel free to ask for it.
Sadly true, and right now we did not uninstalled the packages which was installed as root user.
We just uninstalled the packages for the SITE User, but not with the root user.
I guess it is recommended to uninstall them as root user as well?
We installed 1st as root user:
pip3 install âurllib3<2â redfish
We would like to monitor a Raritan PX4 device and got the error âModuleNotFoundError: No module named âredfishââ
Then we switched to SITE user and run the following pip3 commands:
pip3 install redfish
pip3 install âurllib3<2â
pip3 install redfish
pip3 uninstall âurllib3<2â
pip3 install boto3/boto
pip3 install boto3
pip3 list
pip3 list | grep urllib
pip3 list | grep boto
pip3 install redfish
pip3 install boto3
pip3 install boto3 --upgrade
pip3 list | grep boto
pip3 list | grep urllib
pip3 install âurllib3<2â
pip3 install urllib3<2
pip3 install âurllib3<2â redfish
pip3 install âurllib3<2â redfish --upgrade
pip3 uninstall boto3
pip3 uninstall botocore
pip3 uninstall urlib3
pip3 uninstall urllib3
pip3 uninstall redfish
The following error occurred after âpip3 install redfishâ
DEPRECATION: jsonpath-rw is being installed using the legacy âsetup.py installâ method, because it does not have a âpyproject.tomlâ and the âwheelâ package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the ââuse-pep517â option. Discussion can be found at Deprecate call to `setup.py install` when `wheel` is absent for source distributions without pyproject.toml ¡ Issue #8559 ¡ pypa/pip ¡ GitHub
Running setup.py install for jsonpath-rw ⌠done
ERROR: pipâs dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
botocore 1.20.102 requires urllib3<1.27,>=1.25.4, but you have urllib3 2.1.0 which is incompatible.
This is the problem. @martin.hirschvogel gave the correct pip3 install syntax to avoid this incompatible version of urllib3.
To fix you can do a âpip3 install âurllib3<2â --upgradeâ.
The ââupgradeâ is only as needed if other/newer versions of the libs are already installed. It is more a downgrade then.
If you get an error with botocore like in your output here
Then there are too many libs installed.
If you take a look at the folder â~/local/lib/python3/â your should see only the following entries. If no other things are installed than the Redfish bindings.
Nov 15 21:21 certifi-2023.7.22.dist-info/
Nov 15 21:21 certifi/
Nov 15 21:21 charset_normalizer-3.3.2.dist-info/
Nov 15 21:21 charset_normalizer/
Nov 15 21:21 decorator-5.1.1.dist-info/
Nov 15 21:21 decorator.py
Nov 15 21:21 idna-3.4.dist-info/
Nov 15 21:21 idna/
Nov 15 21:21 jsonpatch-1.33.dist-info/
Nov 15 21:21 jsonpatch.py
Nov 15 21:21 jsonpath_rw-1.4.0-py3.11.egg-info/
Nov 15 21:21 jsonpath_rw/
Nov 15 21:21 jsonpointer-2.4.dist-info/
Nov 15 21:21 jsonpointer.py
Nov 15 21:21 ply-3.11.dist-info/
Nov 15 21:21 ply/
Nov 15 21:21 redfish-3.2.1.dist-info/
Nov 15 21:21 redfish/
Nov 15 21:21 requests_toolbelt-1.0.0.dist-info/
Nov 15 21:21 requests_toolbelt/
Nov 15 21:21 requests_unixsocket-0.3.0.dist-info/
Nov 15 21:21 requests_unixsocket/
Nov 15 21:21 requests-2.31.0.dist-info/
Nov 15 21:21 requests/
Nov 15 21:21 six-1.16.0.dist-info/
Nov 15 21:21 six.py
Nov 15 21:21 urllib3-1.26.18.dist-info/
Nov 15 21:21 urllib3/
For now - correct me if Iâm wrong - we need to add an extra host for the management board to be monitored with this agent. It is not another protocol in management boardsâ list (like SNMP and IPMI). Would it be possible to add it to this list ? Alternatively, would it be possible to use the âAdditional IPv4 addressesâ field in the host properties to avoid adding extra hosts ?
Yes, you need to create an extra host for mgmt boards. I added further instructions to make clear how to use it.
create a host for each management board you want to monitor via Redfish and configure the rule âRedfish Compatible Management Controllerâ for that host
While mgmt boards are quite closely connected to a server, they should be two different logical entities from a monitoring perspective.
We mixed this up when we added the option âManagement boardâ in the âHost propertiesâ (which was a customer request).
This has lead to many problems down the road, because it leads to false assumptions in monitoring and what we believe incorrect alerting (we are having intense discussions internally and with partners if this way should be deprecated).
We internally do therefore the following and recommend this as a best practice:
I fully agree to what Martin is writing here. The BMC of a server is a totally separated device and has nothing todo with the OS running on the hardware. The BMC is accessible even if the OS is not running. This way even if the CPU is burned you can monitor and see what is the issue.
To manage that we use the hostname and add _mgmt for the BMC host object.
I can only support the opinion of @mike1098 and @martin.hirschvogel.
In my managed systems all management controllers are separate host objects.
Mostly it is something like âhostname-iloâ or âhostname-idracâ.
Can you give a few examples where you think this has led to problems and false assumption and incorrect alerting ?
The BMC only monitors the hardware/firmware of the affected server independently of the OS. Nevertheless, it is still the same system. If you create two hosts for this purpose, the relationship between the hosts may be missing. An example: A RAM module fails on the server. With current large RAM modules and depending on the server config, a large percentage of RAM can be lost. The remaining RAM may then fill up in the OS. Two alarms may then appear for different hosts. Depending on the size of a company and the number of monitored systems, the Ops team that handles the alarms and tickets for downstream departments will then create two tickets. One for the infra team (BMC alarm) and one for the sysadmins (OS alarm). This means that there is actually no connection between the one problem that is caused by the other. (Yes, I know thatâs a bit exaggerated)
You could create a relationship via business intelligence and corresponding rules, but the effort is much higher.
One example for a technical problem:
BMC is not very powerful. Response via IPMI (can) take very long, e.g. >1min. Host represents monitoring of OS via agent + mgmt board via IPMI. Host is checked every minute, thus every minute IPMI is being queried. Overlapping requests to IPMI â Boom. Typical resolution for any device, which canât respond in time (within 1min to a requestâŚ), decrease frequency of polls. But you canât uncouple checking of agents per host (technically you can do everything, but that creates then again a lot of problem down the road)
Then there are inconsistency problems, which created several support requests already:
The Mgmt Board can and will have the same services as the OS, e.g. Uptime. There is a big difference between OS Uptime and Mgmt Board Uptime. For that reason, we prefix Mgmt Board Services which are configured via host properties with âManagement Interface:â
However, the same BMC monitored independently, will not have the same prefix, which confuses users and the rules they created for monitoring.
As a side effect of this all, the technical debt generated by allowing two ways for monitoring one thing is substantial on our side (implementing DNS cache twiceâŚ).
Which maybe helps understand why we internally have very intense discussions regarding the mgmt board monitoring via host properties.
At least for IBM/Lenovo I can report that the special agent runs below 2 seconds in case IPMI LAN 2.0 is used (aprox 1000 units) and below 20 Seconds for very old server not supporting LAN 2.0. (aprox 300 units).
Hi @martin.hirschvogel, do you also plan integration of Andreasâ Redfish Checks fĂźr Lenovo XClarity, at least somewhere further down the roadmap? (Checkmk Exchange)
I know this was not the original call for beta testing, but I can confirm that this works nicely, as long as you use pip3 install 'urllib3<2' redfish for this plugin, too.
This will 99% not happen as we target a generic Redfish agent, which works for everything. Thatâs why we need the testing. Please try the generic agent and see if that works for you (I wouldnât be surprised if Andreas isnât also already using it himself for Lenovo).
The universal redfish agent has all the checks included from the xClarity agent.
That means with the universal agent you should get the same information and a bit more than with the xClarity one.