Service Discovery via API not working as expected

CMK version:
Check_MK version 2.1.0p23 CRE

OS version:
Ubuntu 20.04.5 LTS

Hi all
We recently switched from 1.6 to 2.1. In our infrastructure we use the API to automate the creation of hosts. We perform the following steps:

create_host
discover_services
activate_changes

Unfortunately in the new version the discovery of services does not work reliably. Sometimes the services are discovered an sometimes not. We call the endpoints from a python script via the requests library. We also tried different modes like fix_all or refresh.

If we issue the commands via python one by one by hand it all works fine. But as soon as we issue them from a script it stops working.

I saw that there have been problems with service discovery before. Is there any known issue with this endpoint that might explain our case.

Thanks and best regards
Sebastian

1 Like

No ideas or similar observations on this one?

Do you have an example or a repro of the problem ?

We use a self written python API client where we call the API endpoints:

We have a workflow where some virtual machines are setup automatically and added into CheckMK. All actions that need to be done in CheckMK are queued into a Task-Queue, which is then processed in batches.

Below is a testing replica of the original logic to give you a picture how the functions are called. The calls list in this case resembles the Task-Queue.

calls = [
    {
        'method': api.create_host,
        'params': {
            'agent_type': 'cmk-agent',
            'host_type': '<type>',
            'hostname': '<hostname>',
        }
    },
    {
        'method': api.discover_services,
        'params': {
            'hostname': '<hostname>',
            'mode': 'fix_all'
        }
    }
]

for method in calls:
    method['method'](**method['params'])

The method we use to create a host (this always works without problems):

def create_host(.....):
[...]
data = {
            'host_name': hostname,
            'folder': self.api_folder + '/' + host_type,
            'attributes': {
                'tag_<company-name>_host_type': host_type,
                'tag_agent': agent_type
            }
        }
 r = self._post_data("domain-types/host_config/collections/all", data)
[...]
# logic for setting parents, error handling etc.

The method to discover services for the host created before:

    def discover_services(self, hostname, mode='fix_all'):

        data = {
            "host_name": hostname,
            "mode": mode
        }
        r = self._post_data('domain-types/service_discovery_run/actions/start/invoke', data)

When used like this, the service discovery does not work most of the times. During our testing and debugging the discovery sometimes worked and sometimes it didn’t.

We saw that waiting a little after host creation before starting the service discovery helped in tries, but not always. The results were a bit better, when we used our API client directly from the CLI. But also here it did not work every time.

We also see all the API calls in the apache log and could also see that one call waits until the call before is finished. Also the return codes are always like expected (but the check_table field in the discovery response is empty).

The only way service discovery works reliably is with cmk -Iv <hostname> on the monitoring service directly. This works everytime and we now have implemented a workaround that uses this fact.

In version 1.6 we were using the WebAPI which always worked perfectly fine.

I would really appreciate to get some insight or hints on this issue.

Thank you and best regards

Any updates on this?

I am not able to reproduce your problem on 2.1.0p24. In my case, I just use the Python requests example available in the documentation. Can you share the output of your API calls that you make ?

Yeah if you use the code from the API documentation on their own it works. The problem seems to occur, when we try to automate it and several requests are being made in a row.

I agree with you that it does not make sense. That’s why I asked here, if there are any known problems with timing between creating a host and discovering its services…

Hi slehner, Hi community,

late reply on this, i know, but:

From my perspective we see here the following:
If you call the service_discovery_run endpoint with “fix_all” it’s nothing else than “Accept All” in the GUI.
But if there are no services discovered already, there’s nothing to Accept at all.

So i would suggest to call the endpoint with “refresh” first and afterwards with “fix_all”.
For that you have to be aware of two things:

  • “refresh” is a rescan in 2.2, but in 2.1 and before it’s tabula_rasa. With 2.2 there’s a new option to call tabula_rasa explicitly.
  • With “refresh” you have to handle the redirect to the “Wait for service discovery completion” endpoint properly.

At this point i have to do some advertise for our Ansible-Collection.
We just released a new version with the discovery-module handling the redirect and also with the bulk discovery feature implemented. :wink:

KR,
Max

edit: refresh in 2.1 and before is, like in 2.2, just a rescan. :wink:

This is not correct in 2.1 “refresh” is also only a rescan and you need to issue a “fix_all” after the “refresh”.
In 2.2 the “tabula_rasa” is a big help as it combines booth.

Not sure about this.
This is the text in the REST-API Docs:

`refresh` - Update inforomation from host, then remove all existing and add all just found services and host labels

At least sounds like more than just a refresh. :wink:
Did you see that it just does a rescan?
That would be interesting.

It sounds like, but it is not doing this.

Yes i had today a big system migrating from 2.0 to 2.1 and than 2.2.
Was testing on all three versions the behavior for the service discovery.
2.0 & 2.1 → “refresh” and “fix_all” is needed
2.2 → “tabula_rasa” is enough

1 Like

Hi Andreas,

thanks for the hint!
We found out that the description should have changed in 2.0 and 2.1 already. But it just went into 2.2. :wink:

KR,
Max

Hi all,

we are trying to do a service discovery in 2.1.0p34.

The refresh option of the service discovery gives us the information that the Service discovery background job is initialized.
When we try to get the result of the job via the Endpoint /objects/service_discovery_run/{host_name} it still keeps saying that the job is initialized:

I also can’t see the Background job in “Background jobs overview”.

It seems that the Refresh option does not start the Discovery for some reasons.

Have you see a similar behaviour in 2.1 or a solution for this?

Best Regards
Thomas

We have found out that this is related to the distributed monitoring.

When the agent is on the central site everything is working fine.

When the agent is on a remote site we got only the information from the Endpoint /objects/service_discovery_run/{host_name} that the job is initialized, BUT it is running in the background and working. You only can’t get the state…

like i wrote here

I think it would be also good to wait some seconds between the two commands.

1 Like

Hi Thomas,

we are aware of this problem and will fix it with one of the next versions.
Problem is that with the options “refresh” and “tabula_rasa” you will be redirected to the service completion endpoint. This endpoint can’t find the background job if it’s running on a remote site.

KR,
Max

Dear all,

sorry for the delay!
For this fix we had to work a little more in the background.
But with the next version the redirect should work:

KR,
Max

4 Likes

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.