Redfish problems (trying to monitor iLO 5)

CMK version: 2.3p6

We recently upgraded our CheckMK version from 2.1 to 2.3. We had to replace our old plugins that monitored our ILO’s, which was from the Exchange (written by Andreas Doehler). The original plugin was specific to iLO’s and used the Redfish API’s. I saw that Redfish plugin was now shipped with CheckMK 2.3, so the plugin was deprecated and I have removed accordingly.

I created a new rule for “Redfish compatible management controller” which contains the API username and password used to connect to our iLO’s. This seemed to work initially, but it keeps failing after the initial discovery. Every time I run a tabula rasa on the ILO host, the services all discover properly and its fine, then 5 minutes later it will start failing again.

I dont know whats wrong. I then found there were some updates on Andreas’ Github, which I installed. Now the plugin completely fails.

Any help would be appreciated. This used to work flawlessly, but really struggles since the update.

Output below:

Output of “cmk --debug -vvn hostname”:

value store: synchronizing
Trying to acquire lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx01r
Got lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx01r
value store: loading from disk
Releasing lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx01r
Released lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx01r
Checkmk version 2.3.0p6
+ FETCHING DATA
  Source: SourceInfo(hostname='ukccd-p-esx01r', ipaddress='10.44.1.131', ident='special_redfish', fetcher_type=<FetcherType.SPECIAL_AGENT: 6>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f25ade933e0]
Read from cache: AgentFileCache(ukccd-p-esx01r, path_template=/omd/sites/ctshirts/tmp/check_mk/data_source_cache/special_redfish/{hostname}, max_age=MaxAge(checking=0, discovery=90.0, inventory=90.0), simulation=False, use_only_cache=False, file_cache_mode=6)
Not using cache (Too old. Age is 744 sec, allowed is 0 sec)
Calling: /omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/libexec/agent_redfish -u checkmk --password-id uuid2dba00aa-ae49-43ce-b209-89b3f61058d3:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https -m Memory,Power,Processors,Thermal,FirmwareInventory,NetworkAdapters,NetworkInterfaces,EthernetInterfaces,Storage,ArrayControllers,SmartStorage,HostBusAdapters,PhysicalDrives,LogicalDrives,Drives,Volumes,SimpleStorage 10.44.1.131
Get data from program
[cpu_tracking] Stop [7f25ade933e0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.02, children_system=0.01, elapsed=0.05000000074505806))]
[cpu_tracking] Start [7f25ad414260]
+ PARSE FETCHER RESULTS
Check_MK Agent       PEND Check plug-in received no monitoring data
[cpu_tracking] Stop [7f25ad414260 - Snapshot(process=posix.times_result(user=0.010000000000000231, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.009999997913837433))]
[special_redfish] ModuleNotFoundError: No module named 'cmk.plugins.redfish.special_agents.agent_redfish'(!!), Missing monitoring data for all plugins(!), execution time 0.1 sec | execution_time=0.060 user_time=0.010 system_time=0.000 children_user_time=0.020 children_system_time=0.010 cmk_time_ds=0.020
Agent exited with code 1: Traceback (most recent call last):
  File "/omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/libexec/agent_redfish", line 10, in <module>
    from cmk.plugins.redfish.special_agents.agent_redfish import main
ModuleNotFoundError: No module named 'cmk.plugins.redfish.special_agents.agent_redfish'(!!)

I downgraded the redfish plugin back to v2.3.38 and now get the following debug:

value store: synchronizing
Trying to acquire lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx01r
Got lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx01r
value store: loading from disk
Releasing lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx01r
Released lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx01r
Checkmk version 2.3.0p6
Updating IPv4 DNS cache for ukccd-p-esx01r: 10.44.1.131
Trying to acquire lock on /omd/sites/ctshirts/var/check_mk/ipaddresses.cache
Got lock on /omd/sites/ctshirts/var/check_mk/ipaddresses.cache
Releasing lock on /omd/sites/ctshirts/var/check_mk/ipaddresses.cache
Released lock on /omd/sites/ctshirts/var/check_mk/ipaddresses.cache
+ FETCHING DATA
  Source: SourceInfo(hostname='ukccd-p-esx01r', ipaddress='10.44.1.131', ident='piggyback', fetcher_type=<FetcherType.PIGGYBACK: 4>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7fc666462f60]
Read from cache: NoCache(ukccd-p-esx01r, path_template=/dev/null, max_age=MaxAge(checking=0.0, discovery=0.0, inventory=0.0), simulation=False, use_only_cache=False, file_cache_mode=1)
No piggyback files for 'ukccd-p-esx01r'. Skip processing.
No piggyback files for '10.44.1.131'. Skip processing.
Get piggybacked data
[cpu_tracking] Stop [7fc666462f60 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
[cpu_tracking] Start [7fc666ec21b0]
+ PARSE FETCHER RESULTS
  HostKey(hostname='ukccd-p-esx01r', source_type=<SourceType.HOST: 1>)  -> Add sections: []
Received no piggyback data
No piggyback files for 'ukccd-p-esx01r'. Skip processing.
No piggyback files for '10.44.1.131'. Skip processing.
[cpu_tracking] Stop [7fc666ec21b0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
[piggyback] Success (but no data found for this host), execution time 0.0 sec | execution_time=0.000 user_time=0.000 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.000

This has had the result of making the discovery work again after running a tabula rasa, but it still randomly fails constantly. It will always work when I manually run a discovery but fails after.

This message is strange. I also upgraded a system today to the 2.3 version.
Please check first “cmk -D hostname” what you see in the line starting with

Type of agent:
  Program: /omd/sites/....

If you use this complete special agent call you can add “–debug” and “-vvv” to get a better output.

From my point of view it looks like there are some old parts left or the mkp was not correctly installed.

Your second output looks like no agent at all is assigned to the host object.

Output of the cmk -D hostname command:

Type of agent:          Program: /omd/sites/ctshirts/local/lib/python3/cmk/plugins/redfish/libexec/agent_redfish -u checkmk --password-id uuid2dba00aa-ae49-43ce-b209-89b3f61058d3:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https -m Memory,Power,Processors,Thermal,FirmwareInventory,NetworkAdapters,NetworkInterfaces,EthernetInterfaces,Storage,ArrayControllers,SmartStorage,HostBusAdapters,PhysicalDrives,LogicalDrives,Drives,Volumes,SimpleStorage 10.44.1.139

Output of the full verbose debug:

value store: synchronizing
Trying to acquire lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx05r
Got lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx05r
value store: loading from disk
Releasing lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx05r
Released lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx05r
Checkmk version 2.3.0p6
+ FETCHING DATA
  Source: SourceInfo(hostname='ukccd-p-esx05r', ipaddress='10.44.1.139', ident='special_redfish', fetcher_type=<FetcherType.SPECIAL_AGENT: 6>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f7eae101940]
Read from cache: AgentFileCache(ukccd-p-esx05r, path_template=/omd/sites/ctshirts/tmp/check_mk/data_source_cache/special_redfish/{hostname}, max_age=MaxAge(checking=0, discovery=90.0, inventory=90.0), simulation=False, use_only_cache=False, file_cache_mode=6)
Not using cache (Too old. Age is 5 sec, allowed is 0 sec)
Calling: /omd/sites/ctshirts/local/lib/python3/cmk/plugins/redfish/libexec/agent_redfish -u checkmk --password-id uuid2dba00aa-ae49-43ce-b209-89b3f61058d3:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https -m Memory,Power,Processors,Thermal,FirmwareInventory,NetworkAdapters,NetworkInterfaces,EthernetInterfaces,Storage,ArrayControllers,SmartStorage,HostBusAdapters,PhysicalDrives,LogicalDrives,Drives,Volumes,SimpleStorage 10.44.1.139
Get data from program
Write data to cache file /omd/sites/ctshirts/tmp/check_mk/data_source_cache/special_redfish/ukccd-p-esx05r
Trying to acquire lock on /omd/sites/ctshirts/tmp/check_mk/data_source_cache/special_redfish/ukccd-p-esx05r
Got lock on /omd/sites/ctshirts/tmp/check_mk/data_source_cache/special_redfish/ukccd-p-esx05r
Releasing lock on /omd/sites/ctshirts/tmp/check_mk/data_source_cache/special_redfish/ukccd-p-esx05r
Released lock on /omd/sites/ctshirts/tmp/check_mk/data_source_cache/special_redfish/ukccd-p-esx05r
[cpu_tracking] Stop [7f7eae101940 - Snapshot(process=posix.times_result(user=0.010000000000000231, system=0.0, children_user=0.54, children_system=0.1, elapsed=2.8599999994039536))]
[cpu_tracking] Start [7f7ead4a7a70]
+ PARSE FETCHER RESULTS
<<<check_mk:sep(32)>>> / Transition NOOPParser -> HostSectionParser
<<<redfish_manager:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_system:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_storage:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_ethernetinterfaces:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_networkinterfaces:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_memory:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_processors:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_arraycontrollers:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_hostbusadapters:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_ethernetinterfaces:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_networkadapters:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_chassis:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_power:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_thermal:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<redfish_networkadapters:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
  HostKey(hostname='ukccd-p-esx05r', source_type=<SourceType.HOST: 1>)  -> Add sections: ['check_mk', 'redfish_arraycontrollers', 'redfish_chassis', 'redfish_ethernetinterfaces', 'redfish_hostbusadapters', 'redfish_manager', 'redfish_memory', 'redfish_networkadapters', 'redfish_networkinterfaces', 'redfish_power', 'redfish_processors', 'redfish_storage', 'redfish_system', 'redfish_thermal']
Received no piggyback data
Perfdata(name='01-Inlet Ambient', value=15.0, levels_upper=('fixed', (42.0, 42.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='02-CPU 1', value=40.0, levels_upper=('fixed', (70.0, 70.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='03-CPU 2', value=40.0, levels_upper=('fixed', (70.0, 70.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='04-P1 DIMM 1-6', value=30.0, levels_upper=('fixed', (90.0, 90.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='06-P1 DIMM 7-12', value=30.0, levels_upper=('fixed', (90.0, 90.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='08-P2 DIMM 1-6', value=30.0, levels_upper=('fixed', (90.0, 90.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='10-P2 DIMM 7-12', value=31.0, levels_upper=('fixed', (90.0, 90.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='14-Stor Batt 1', value=15.0, levels_upper=('fixed', (60.0, 60.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='15-Front Ambient', value=18.0, levels_upper=('fixed', (60.0, 60.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='16-VR P1', value=29.0, levels_upper=('fixed', (115.0, 115.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='17-VR P2', value=31.0, levels_upper=('fixed', (115.0, 115.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='18-VR P1 Mem 1', value=23.0, levels_upper=('fixed', (115.0, 115.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='19-VR P1 Mem 2', value=23.0, levels_upper=('fixed', (115.0, 115.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='20-VR P2 Mem 1', value=24.0, levels_upper=('fixed', (115.0, 115.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='21-VR P2 Mem 2', value=25.0, levels_upper=('fixed', (115.0, 115.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='22-Chipset', value=42.0, levels_upper=('fixed', (100.0, 100.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='23-BMC', value=68.0, levels_upper=('fixed', (110.0, 110.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='24-BMC Zone', value=39.0, levels_upper=('fixed', (90.0, 90.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='25-HD Controller', value=52.0, levels_upper=('fixed', (100.0, 100.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='26-HD Cntlr Zone', value=29.0, levels_upper=('fixed', (85.0, 85.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='29-I/O Zone', value=29.0, levels_upper=('fixed', (90.0, 90.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='30.1-PCI 1-I/O module', value=51.0, levels_upper=('fixed', (100.0, 105.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='31-PCI 1 Zone', value=36.0, levels_upper=('fixed', (90.0, 90.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='33-PCI 2 Zone', value=36.0, levels_upper=('fixed', (90.0, 90.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='38-Battery Zone', value=33.0, levels_upper=('fixed', (75.0, 75.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='39-P/S 1 Inlet', value=22.0, levels_upper=None, levels_lower=None, boundaries=(None, None))
Perfdata(name='40-P/S 2 Inlet', value=35.0, levels_upper=None, levels_lower=None, boundaries=(None, None))
Perfdata(name='41-P/S 1', value=40.0, levels_upper=None, levels_lower=None, boundaries=(None, None))
Perfdata(name='42-P/S 2', value=40.0, levels_upper=None, levels_lower=None, boundaries=(None, None))
Perfdata(name='43-E-Fuse', value=28.0, levels_upper=('fixed', (100.0, 100.0)), levels_lower=None, boundaries=(None, None))
Perfdata(name='44-P/S 2 Zone', value=35.0, levels_upper=('fixed', (75.0, 75.0)), levels_lower=None, boundaries=(None, None))
###REDACTED OUTPUT###
[cpu_tracking] Stop [7f7ead4a7a70 - Snapshot(process=posix.times_result(user=0.04999999999999982, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.05000000074505806))]
value store: synchronizing
Trying to acquire lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx05r
Got lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx05r
value store: already loaded
Releasing lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx05r
Released lock on /omd/sites/ctshirts/tmp/check_mk/counters/ukccd-p-esx05r
[special_redfish] Success, execution time 2.9 sec | execution_time=2.910 user_time=0.060 system_time=0.000 children_user_time=0.540 children_system_time=0.100 cmk_time_ds=2.210

When I update to your latest version (2.3.44) it completely breaks it and does indeed look like the entire plugin is missing.

Very odd behaviour. Not really sure where to go next.

Despite the debug output showing successful discovery data it keeps showing the following:

First i would do the following steps.
Disable my latest version of the mkp and remove it.
Enable the included mkp.
Your output from this version looked good.
What happen if you do a “cmk --debug -vvII ukccd-p-esx05r” - only the last lines are important if he finds some services.
Then after this do a “cmk --debug -vvR”.
If this runs correctly, what happens with the host inside the monitoring?

What i requested in my last post was not a debug run of the “cmk” command. I wanted to know what happend if you execute the agent program call manually.
In your case.

/omd/sites/ctshirts/local/lib/python3/cmk/plugins/redfish/libexec/agent_redfish --debug -vvv -u checkmk --password-id uuid2dba00aa-ae49-43ce-b209-89b3f61058d3:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https 10.44.1.139

The module selection is not needed if you not exclude anything. I only added the “–debug” and “-vvv”.

2.3.44 it looks like that i have there forgotten to rewrite all paths.
2.3.43 from github should work. But i will fix this problem.

1 Like

Fixed version 2.3.45 is on github and on review for the exchange.

2 Likes

I have updated to 2.3.45 and it is working better than it did in 2.3.44 but I am seeing the same issues as the shipped version (2.3.38).

The issue is that it works fine for a while, but every few minutes the Check_MK service goes critical and says that the plugin files cannot be found.

I tried to run this command:
/omd/sites/ctshirts/local/lib/python3/cmk/plugins/redfish/libexec/agent_redfish --debug -vvv -u checkmk --password-id uuid2dba00aa-ae49-43ce-b209-89b3f61058d3:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https 10.44.1.139

However, it appears that this folder is completely empty! I get command not found. This would seem to indicate that some of the agent files are missing, which I suspect means that the MKP did not correctly install. It is still bizarre how this is working at all if this is the case!

How would I go about resolving the missing files?

After installing the 2.3.45 you need to get the new command line with “cmk -D hostname”.
Please don’t run this agent as root or with sudo. Do all these checks only as site user.

Aaaah I see what you mean now! Thanks, sorry I keep forgetting it should all be run as the OMD site user.

This command now works and retrieves all of the correct information when logged on as site user:

/omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/libexec/agent_redfish -u checkmk --password-id uuid6779d0f0-9ceb-46f2-8d7a-e4cb1158e99d:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https 10.44.1.139

Output of “cmk --debug -vvII ukccd-p-esx05r” is blank.

After running “cmk --debug -vvR” nothing really changed, I am still seeing this service flapping on all of my ILO hosts:

I didnt realise the I’s were uppercase! The forum font makes them look like lowercase l’s!! :slight_smile:

This outputs the following:

<TRUNCATED>
SUCCESS - Found 2 host labels
+ ANALYSE DISCOVERED SERVICES
+ EXECUTING DISCOVERY PLUGINS (14)
  Trying discovery with: redfish_fans, redfish_ethernetinterfaces, redfish_arraycontrollers_hpe, checkmk_agent, redfish_temperatures, redfish_networkadapters, redfish_system, redfish_processors, redfish_memory_summary, redfish_storage, redfish_psu, redfish_memory, redfish_arraycontrollers_generic, redfish_voltage
  1 checkmk_agent
  1 redfish_arraycontrollers_hpe
  4 redfish_ethernetinterfaces
  7 redfish_fans
 16 redfish_memory
  1 redfish_memory_summary
  4 redfish_networkadapters
  2 redfish_processors
  2 redfish_psu
  1 redfish_storage
  1 redfish_system
 31 redfish_temperatures
SUCCESS - Found 71 services

I tried again on another ILO host that had gone stale and this was the output of the command:

OMD[ctshirts]:~$ cmk --debug -vvII ukccd-p-esx04r
Discovering services and host labels on: ukccd-p-esx04r
ukccd-p-esx04r:
+ FETCHING DATA
  Source: SourceInfo(hostname='ukccd-p-esx04r', ipaddress='10.44.1.137', ident='special_redfish', fetcher_type=<FetcherType.SPECIAL_AGENT: 6>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7feaf9e9f560]
Read from cache: AgentFileCache(ukccd-p-esx04r, path_template=/omd/sites/ctshirts/tmp/check_mk/data_source_cache/special_redfish/{hostname}, max_age=MaxAge(checking=0, discovery=90.0, inventory=90.0), simulation=False, use_only_cache=False, file_cache_mode=1)
Calling: /omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/libexec/agent_redfish -u checkmk --password-id uuid6779d0f0-9ceb-46f2-8d7a-e4cb1158e99d:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https 10.44.1.137
Get data from program
[cpu_tracking] Stop [7feaf9e9f560 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.55, children_system=0.05, elapsed=2.5900000035762787))]
+ PARSE FETCHER RESULTS
+ ANALYSE DISCOVERED HOST LABELS
Trying host label discovery with:
Trying host label discovery with:
SUCCESS - Found no host labels
+ ANALYSE DISCOVERED SERVICES
+ EXECUTING DISCOVERY PLUGINS (0)
  Trying discovery with:
SUCCESS - Found no services

It seems that this is flapping between discovering services and then not, it happens constantly. If I run this command multiple times, it will find services, then it won’t.

That would be strange.
Your last post looked ok - you found 71 services and after this i would do a “cmk --debug -vvR” to activate these newly found services.

Yes thats the weird thing - it looks like its working fine, but all of them keep randomly going stale, and it cant find the services. Then it goes back to normal. Basically I can’t seem to figure out why they are going stale all the time.

After running “cmk --debug -vvR” nothing has changed. I currently see this:

I checked on the host “ukccd-p-esx02r” which shows all services vanished, and I ran a manual service discovery, all the services reappeared.

It will then work for a few minutes, then start showing vanished or stale services again.

Is this worth logging with the support team? We do have full enterprise support, I just thought since you wrote the plugin you’d be the best person to speak to anyway!

Really appreciate your help by the way!

Doing some more debugging. I tried to run the agent directly against one of the flapping ILO hosts, and now this is appearing:

OMD[ctshirts]:~$ /omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/libexec/agent_redfish -u checkmk --password-id uuid6779d0f0-9ceb-46f2-8d7a-e4cb1158e99d:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https 10.44.1.139
Agent failed - please submit a crash report! (Crash-ID: 9980126a-2d73-11ef-b0c6-13a800439a43)

Looks like its now crashing - I cannot get this to run any more. The actual hosts are still flapping between all OK and redfish failing.

I did log this with support, but they are saying they wont support this plugin since its still experimental. Looks like I will have to go back to SNMP monitoring until we can fix this! :frowning:

Of course I would prefer to get this Redfish plugin to work, and will happily assist in all troubleshooting, so any help you can still offer is really appreciated!

If you run the agent manually i would every time include the “–debug” and “-vv” switches.

With your description i would say that you have real problem inside your site.

My test to get a clean result would be

  • install a clean 2.3.0p6 site
  • only install the latest redfish mkp without activating the included ones
  • add at first only one of the hosts

This is why I really didnt want to upgrade :frowning:

That’s a lot of work! I’ve already spent 3 days fixing issues after the upgrade lol. Do you know what could be the issue in the site, so I can get support to help me? They wont help if I mention this plugin.

Ah yes it would help if I used the debug switches!

Getting some odd password issue apparently:

OMD[ctshirts]:~$ /omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/libexec/agent_redfish --debug -vvv -u checkmk --password-id uuid6779d0f0-9ceb-46f2-8d7a-e4cb1158e99d:/omd/sites/ctshirts/var/check_mk/passwords_merged -P https 10.44.1.139
INFO 2024-06-18 16:03:30 root: running file /omd/sites/ctshirts/lib/python3/cmk/special_agents/v0_unstable/agent_common.py
INFO 2024-06-18 16:03:30 root: using Python interpreter v3.12.3.final.0 at /omd/sites/ctshirts/bin/python3
DEBUG 2024-06-18 16:03:30 root: args: {'debug': True, 'verbose': 3, 'vcrtrace': False, 'user': 'checkmk', 'password': None, 'password_id': 'uuid6779d0f0-9ceb-46f2-8d7a-e4cb1158e99d:/omd/sites/ctshirts/var/check_mk/passwords_merged', 'proto': 'https', 'port': 443, 'sections': 'Power,Thermal,Memory,NetworkAdapters,NetworkInterfaces,Processors,Storage,EthernetInterfaces,FirmwareInventory,SmartStorage,ArrayControllers,HostBusAdapters,LogicalDrives,PhysicalDrives,SimpleStorage,Drives,Volumes', 'verify_ssl': False, 'timeout': 3, 'retries': 2, 'host': '10.44.1.139'}
INFO 2024-06-18 16:03:30 redfish: Redfish API
Traceback (most recent call last):
  File "/omd/sites/ctshirts/lib/python3/cmk/utils/password_store/__init__.py", line 181, in lookup
    return load(pw_file)[pw_id]
           ~~~~~~~~~~~~~^^^^^^^
KeyError: 'uuid6779d0f0-9ceb-46f2-8d7a-e4cb1158e99d'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/libexec/agent_redfish", line 13, in <module>
    sys.exit(main())
             ^^^^^^
  File "/omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py", line 638, in main
    return special_agent_main(parse_arguments, agent_redfish_main)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/ctshirts/lib/python3/cmk/special_agents/v0_unstable/agent_common.py", line 171, in special_agent_main
    return _special_agent_main_core(parse_arguments, main_fn, argv or sys.argv[1:])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/ctshirts/lib/python3/cmk/special_agents/v0_unstable/agent_common.py", line 148, in _special_agent_main_core
    return main_fn(args)
           ^^^^^^^^^^^^^
  File "/omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py", line 627, in agent_redfish_main
    redfishobj = get_session(args)
                 ^^^^^^^^^^^^^^^^^
  File "/omd/sites/ctshirts/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py", line 577, in get_session
    else password_store.lookup(Path(pw_path), pw_id)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/ctshirts/lib/python3/cmk/utils/password_store/__init__.py", line 185, in lookup
    raise ValueError(f"Password '{pw_id}' not found in {pw_file}")
ValueError: Password 'uuid6779d0f0-9ceb-46f2-8d7a-e4cb1158e99d' not found in /omd/sites/ctshirts/var/check_mk/passwords_merged