Redfish timeout monitoring iLO4

CMK version: 2.3p3
Redfish Version: 2.3.60

First of all, thanks a lot for your hard work work, Andreas. It’ s a great plugin to monitor iLO systems.

Unfortunately we have some timeout problems we cant explain. CheckMK is running on a dedicated HP G9 server with plenty of ressources. No other check is having problems neither were SNMP checks on iLO facing problems with timeouts before. It kinda happens every here and then. We monitor around 30 iLO and redfish is crashing with a timeout every 10-15 minutes. Always on a different iLO, always just one time. Next check is ok again. Any idea how to fix the problem?

Crash report says, in short:

‘allow_redirects’: True,
‘args’: None,
‘attempts’: 3,
‘body’: None,
‘cause_exception’: ReadTimeout(ReadTimeoutError(“HTTPSConnectionPool(host=‘192.168.3.161’, port=443): Read timed out. (read timeout=3)”)),

I have no general idea what the reason is for your problem.
Is it on all systems happening?
You can play around with the timeout settings for the special agent but i don’t think this will fix it generally.
It would be interesting to see a complete crash report.
I hope there is a little bit more like the stack trace.

Thanks for the reply. I figured sometimes when I try to login to iLO i always get a certificate warning first but when i accept the warning sometimes I get an empty website. Refresh fix the problem and I get the login site. Looks a bit like the agent behaves like this as well.

Traceback:
File “/omd/sites/site/lib/python3/cmk/special_agents/v0_unstable/agent_common.py”, line 148, in _special_agent_main_core
return main_fn(args)
File “/omd/sites/site/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py”, line 737, in agent_redfish_main
get_information(redfishobj, sections)
File “/omd/sites/site/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py”, line 624, in get_information
fetch_extra_data(redfishobj, data_model, extra_links, sections, system)
File “/omd/sites/site/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py”, line 259, in fetch_extra_data
result = fetch_sections(redfishobj, extra_links, sections, link_list)
File “/omd/sites/site/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py”, line 213, in fetch_sections
result = fetch_collection(redfishobj, section_data, section)
File “/omd/sites/site/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py”, line 163, in fetch_collection
element_data = fetch_data(redfishobj, element.get(“@odata.id”), component)
File “/omd/sites/site/local/lib/python3/cmk_addons/plugins/redfish/special_agents/agent_redfish.py”, line 144, in fetch_data
response_url = redfishobj.get(url, None)
File “/omd/sites/site/lib/python3.12/site-packages/redfish/rest/v1.py”, line 628, in get
return self._rest_request(path, method=‘GET’, args=args,
File “/omd/sites/site/lib/python3.12/site-packages/redfish/rest/v1.py”, line 1110, in _rest_request
return super(HttpClient, self)._rest_request(path=path, method=method,
File “/omd/sites/site/lib/python3.12/site-packages/redfish/rest/v1.py”, line 954, in _rest_request
raise RetriesExhaustedError() from cause_exception

Local Variables:
{‘allow_redirects’: True,
‘args’: None,
‘attempts’: 3,
‘body’: None,
‘cause_exception’: ReadTimeout(ReadTimeoutError(“HTTPSConnectionPool(host=‘192.168.3.228’, port=443): Read timed out. (read timeout=3)”)),
‘endtime’: 3441456.687761962,
‘headers’: {‘Accept’: ‘/’,
‘OData-Version’: ‘4.0’,
‘inittime’: 3441455.043479192,
‘max_retry’: 2,
‘method’: ‘GET’,
‘path’: ‘/redfish/v1/Systems/1/Memory/proc1dimm1/’,
‘query_str’: None,
‘reqpath’: ‘/redfish/v1/Systems/1/Memory/proc1dimm1/’,
‘resp’: <Response [200]>,
‘restreq’: <redfish.rest.v1.RestRequest object at 0x7f8592ba75c0>,
‘restresp’: <redfish.rest.v1.RestResponse object at 0x7f8592bf7ce0>,
‘self’: <redfish.rest.v1.HttpClient object at 0x7f859441c8f0>,
‘timeout’: 3,
‘verify’: False}

This can be relevant if it is every time the memory modules. You can disable the memory section as you get an overall memory state from the system section. That means is a memory module is faulty you will get an error message but must then look inside the iLO what memory is faulty.^^

Unfortunately it’s not that easy. Since you copied the path variable, here the last 10 crashes, happens on different systems. One pattern I saw, it mainly affects G9 DL380, barely G9 DL360. Systems also having different iLO versions, 2.77 and 2.82.

‘path’: ‘/redfish/v1/Systems/1/FirmwareInventory/’,
‘path’: ‘/redfish/v1/Managers/1/’,
‘path’: ‘/redfish/v1/Systems/1/NetworkAdapters/1/’,
‘path’: ‘/redfish/v1/Systems/1/EthernetInterfaces/2/’,
‘path’: ‘/redfish/v1/Systems/1/EthernetInterfaces/’,
‘path’: ‘/redfish/v1/Systems/1/SmartStorage/ArrayControllers/’,
‘path’: ‘/redfish/v1/Systems/1/Processors/’,
‘path’: ‘/redfish/v1/Systems/1/SmartStorage/ArrayControllers/4/DiskDrives/’,
‘path’: ‘/redfish/v1/ResourceDirectory/’,
‘path’: ‘/redfish/v1/Managers/1/’,

Short Update… I tested a lot of different things, nothing helped. When I downgraded to 2.3.59, I got rid of the problems, but plugin still crashed. But this time timeout problems. And increasing the timout did the job at the end. Version 2.3.59 and 20 seconds now and I never have seen a timeout or crash again.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.