Service Discovery via REST API not working as expected

brandons · May 12, 2023, 9:13pm

Hi Checkmk users,

It seems like i’m encountering maybe similar issues as some others on 2.1.0p26 with rest api discovery not reliably adding services to monitored state.

Reference to another case: Service Discovery via API not working as expected

We have been using the older webapi for last couple years with no issue for automating adding our new servers.

I have been working this week on a rewrite for the new rest api as we recently upgraded 2.0.0p21 to 2.1.0p26.

I have our new rest api script for adding server basically working fine except the discovery part does not reliably add the services to a monitored state.

I’m using a bash script with curl with no issues except discovery so far. It is simple enough.

I am familiar with the process and now familiar with this new rest api for basic interaction.

I am following the same workflow and i’ve reviewed the built in and online related api docs and examples.

create_host
discover_services
activate_changes

I’ve tried refresh, new, fix_all modes seperately and as i will mention below back-to-back for the discovery.

I have tested putting 1-2 minute sleep delay right after the discovery runs in my script to let the background discovery for sure complete for the new host.

I then activate and all the services go to unmonitored.

I have even scripted discovery and activate to run in the script back to back twice (first with ‘new’ and 2nd time with 'fix_all) with delays each time between discovery and activation.

The webapi was pretty reliable for this compared to how the rest api is behaving for this.

Here is an example of my script run where the host does not exist yet in checkmk at all and i try discover/activate twice (even though twice should not be required) for this.

200 OK - host created
200 OK - host discovery
Waiting 60 seconds for checkmk background discovery process to complete (mode=new)
200 OK - activate changes
200 OK - host discovery
Waiting 60 seconds for checkmk background discovery process to complete (mode=fix_all)
200 OK - activate changes

I noticed if i run my entire script yet again a 2nd time after it will error on the host already added but, continue the script execution and then run discovery again and finally it will succeed in moving the unmonitored services to monitored. From that result it seems like there is some larger time delay needed?

We are not in any rush but, we want to switch to the rest api for our server adds so we are not stuck on 2.1 since 2.2 i think removes the old webapi.

Might anyone have any advice for this issue?

CMK version:
2.1.0p26

OS version:
AlmaLinux 8
4.18.0-425.19.2.el8_7.x86_64

Error message:
43 unmonitored services (postfix_mailq_status:1, local:13, nfsmounts:2, diskstat:1, md:2, ipmi:1, checkmk_agent:1, lnx_thermal:2, logwatch:3, cpu_threads:1, mem_linux:1, cpu_loads:1, ps:1, tcp_conn_stats:1, uptime:1, kernel_performance:1, systemd_units_services_summary:1, postfix_mailq:1, df:2, chrony:1, kernel_util:1, lnx_if:2, mounts:2)WARN, no vanished services found, 1 new host labelsWARN

Anders · May 14, 2023, 5:26pm

Not sure about the error message, does seems to be a valid message.
If you do the process manually in the GUI does it work then?

We use the API call that checks the status of the service discovery and just use a for loop until the discovery is complete before we activate changes.

brandons · May 15, 2023, 6:51pm

Hi Anders and Checkmk users,

Thanks for your reply. Yes, adding hosts (including this test host) through the gui works fine and discovers all its services in 1-2 seconds. I’ve repeatedly tested this. I’ve been testing on an empty linux test node that is on the same lan (physically close) and manually through the gui it discovers the services really quickly in 1-2 seconds. Then choosing accept-all and activating manually through the gui all the services reliably go to a monitored state as expected.

To clarify you are referring to rest api option ‘Show the last service discovery background job on a host’ ?

If so yes i had already thought of doing the same loop to monitor it status result. I had already wrote a separate script just to check that to see and i did indeed check it while the script was running in its delay before activation and i was seeing completed result already so did not seem to need to bother with that at least initially in my script testing.

'Show the last service discovery background job on a host example response i was seeing:

…“state”: “finished”, “logs”: {“result”: [], “progress”: [“Starting job…”, “Completed.”

This failure seems is just through trying to script these steps through the rest api. As i mentioned i have tested an intentional big delay of 60 seconds just as a test between discovery and activation and its still failing to move the discovered services to monitored state. The webapi works as expected by comparison.

slehner · May 22, 2023, 7:08am

Hi there

You seem to have the same issue with the REST API we are experiencing. I did not get it to run as intended yet.

Maybe a heads up, to get it running. We have implemented a workaround where we check for undiscovered services via live status and then run a script that does cmk -Iv for all hosts, which have undiscovered services. Maybe this might be interesting for you.

We would be really glad, if the API would work as described. At least for us, it is not the only case, where it does not work as expected.

Cheers Sebastian

robin.gierse · May 22, 2023, 11:17am

Hi guys!
I experienced something that sounds similar to this with our the discovery module of our Ansible Collection. What might help you, is to look at the possible options for the domain-types/service_discovery_run/actions/start/invoke endpoint. fix_all uses cached data - which is intentional - so it can lead to your situation, where you end up with a successful discovery and no services added. With refresh you should be able to avoid your issue. In Checkmk 2.2 we even improved the endpoint further and will enable you to trigger the collection of fresh monitoring data from the data source.
We are however looking into this even further and will give you an update, if we find something relevant.

brandons · May 29, 2023, 6:48pm

Hi Sebastian, I may consider the live status suggestion as a workaround.

Hi Robin, I will look in to your suggestion too once i get back on this.

For the moment i’m re-tasked and will monitor for further updates that may be relevant to this issue.

I appreciate your suggestions.

Brandon