Call for Redfish beta testers

andreas-doehler · September 3, 2024, 7:13am

Small information about the latest versions.

Firmware inventory of the device is active if found.

Problem here - in iDRAC it can lead to an timeout as it takes too much time to fetch the table with the information.
→ Solution

image1123×747 51.2 KB

Disable the inventory in the special agent config
Bug fixed if you want to test on CLI with explicit password
Temperature sensors without reading are ignored at discovery time

Bernt · September 13, 2024, 8:43am

Very often I am getting:

Output: [special_redfish] redfish.rest.v1.RetriesExhaustedError(!!), [piggyback] Success (but no data found for this host), Missing monitoring data for all plugins(!), execution time 18.0 sec

5 minutes later it works fine again. Already have set the timeout to 30 seconds but the error keeps coming back often.
ps. have disabled the firmware versions.

Any posible ideas?

The node in question as a new HPE Alletra 4140, ILO 6 with latest ilo firmware.
When the plugin does get data, it gets all data correct(disk/cpu etcetc)

andreas-doehler · November 4, 2024, 3:35pm

Sorry for the late response. I don’t saw this post.

If all the data is missing, it means already the login is not working. That is strange.
I don’t think that it will be fixed with my newer version i have completed today.
You can only check on the command line if you execute the special agent manually with “–debug” option if you see some more information.

andreas-doehler · November 4, 2024, 3:44pm

New version available - ATTENTION - please first test this version in a test site.
The plugin has some new functions/features that i cannot test in a good way with the Redfish simulator.

New features

labels from the agent itself are now like they are from normal Linux and Windows agent.

image1233×45 5.33 KB

or

image1167×40 4.56 KB
caching for single sections can be activated inside special agent setup
the FirmwareInventory section has, as the only section, a default cache time of 9600 if nothing is defined
FirmwareInventory has an default timeout of 10 seconds beside the normal timeout definition - next step will be the possibility to define timeouts on section basis
cache files are now pickle files

I don’t think that i will backport the caching for 2.2 at the moment.

and upload to the exchange is also done

martin.hirschvogel · November 4, 2024, 4:52pm

Good that we haven’t started mainlining this
Planned for December by the way, but let’s talk on Wednesday Andreas

andreas-doehler · November 5, 2024, 6:35am

Found already one bug in processing the cache files.
If no cache file exists the agent crashes.

Bernt · November 5, 2024, 7:03am

Hi Andreas, It is a very sporadic issue but it looks like the ILO does not respond in time, times out sometimes. Testing further. If i have any more info i’ll let you know.

andreas-doehler · November 5, 2024, 7:12am

I found one error more. Fix is on the way for the 2.3.64.

With the new version? Or some older one?

Bernt · November 5, 2024, 7:23am

I will test with the new release in our test site first for a week or so and see if it helps any.

andreas-doehler · November 5, 2024, 7:28am

Next bug fixed for the cache data.

philrandal · November 29, 2024, 11:08am

After 2.3.60, I get random

[special_redfish] redfish.rest.v1.RetriesExhaustedError

errors.

I’ve edited datasource_program.py to increase the maximum permitted timeout from 20 seconds to 30, but even with a 30 second timeout I still get the error.

Any ideas?

Edit: Trying to disable sections to narrow down the issue gives this error:

ValueError: list.remove(x): x not in list

The user interface with both ‘enabled sections’ and ‘disabled sections’ is a bit confusing.

andreas-doehler · November 29, 2024, 11:53am

don’t need to be a timeout problem

You can also set the timeout from the special agent configuration.

What i have done to find problematic sections is the following.
Don’t define any enabled sections. That’s why i wrote there, that enabled sections is a legacy setting.

Where do you get this error?

philrandal · November 29, 2024, 11:56am

The special agent config limited timeout to 20 seconds. As it turns out, that wasn’t the issue.

The value error is what shows in the agent output in checkmk.

Any hints as to what has been added since 2.3.60 which could cause my issue?

andreas-doehler · November 29, 2024, 12:21pm

at the moment we are at version 2.3.66

These are the latest changes

2.3.60 - special agent fixed if user and password is used on CLI
2.3.61 - small naming change in metric translation
2.3.62 - implemented caching for sections - Attention if you monitor iLO4 please stay at 2.3.60 for the moment
2.3.63 - fixed some iLO specific HW/SW inventory problems - this release should also work with iLO4 now
2.3.64 - bug fixed in processing cache files
2.3.65 - temp folder creating bug fixed
2.3.66 - added discovery option for physical ports

What exactly have you done to get this error?
To get a better error message, you can start the special agent on the command line with addition of “–debug” as parameter.

philrandal · November 29, 2024, 2:54pm

OK, got past that list.remove error. I’ll try to debug further.

Edit:

Disabling the firmware check seems to resolve the issue. Why it worked previously but now doesn’t is the next thing to figure out.

philrandal · November 29, 2024, 3:40pm

Caching the Firmware check for 15 minutes now results in normal agent timeouts on the misbehaving hosts, which is more like what I would expect.

A subtle bug in the caching logic, maybe?

Whatever, I’m happy with disabling the Firmware check as a solution / workaround.

The affected hosts were all Dell R630s, R640s, and R760s so the issue correlates with an earlier post on this thread mentioning Dell firmware inventory.

andreas-doehler · November 29, 2024, 4:43pm

If it only happens at the firmware inventory then - the hardcoded timeout for firmware inventory is not enough on you systems.

starting with line 616

        if redfishobj.vendor_data.expand_string:
            firmwares = fetch_data(
                redfishobj,
                "".join(firmware_url),
                "FirmwareDirectory",
                timeout=10,
            )

you find the hardcoded value.
I will think about a solution for this problem.

Caching for Firmware you can set to 80000 or 90000 seconds that it will only be made one time per day.

philrandal · December 2, 2024, 4:54pm

Increasing the firmware timeout to 15 seconds made no difference to the results when caching is enabled.

The moment I enable caching, I get the original retries exhausted error.

The “ValueError: list.remove(x): x not in list” error, which I get after removing an entry from the disabled list, is resolved by restarting my monitoring server. Even an omd restart sitename doesn’t seem to clear it. Being more tolerant of the response from attempting to add or remove an item from the list would be the best workaround. If you can’t remove an item because it isn’t there, don’t error. Or when you add an item which is already in the list.

andreas-doehler · December 2, 2024, 7:37pm

Until now I had no luck to reproduce the error message.
For the timeout, I will add an debug option. With this option enabled (you can do this for one specific host) a debug log is written at every call of the special agent.
If you get such an retries exhausted error it would be nice if you can sent me such a debug log.

andreas-doehler · December 2, 2024, 8:06pm

Version 2.3.67 is available with debug option inside setup.