Autodiscover doesn not work as expected

Hello

I run CRE 2.0.0p7.
I define a new host and all the services should be automatically added after a while.
I defined:
Bulk discovery

Current setting: Mode “Add unmonitored services and new host labels”
Selection: x Include all subfolders
Performance options: x Do a full service scan
Number of hosts to handle at once 10
Error handling: x Ignore errors in single check plugins

Factory setting Mode: Add unmonitored services and new host labels
Selection: on, off, off, off
Performance options: on, 10
Error handling: on
Current state: This variable is at factory settings.

What I’m doing:

  1. I create a new host via RestAPI. The host will be contacted via a DNS name
  2. DNS name is a dynamically added name and it takes about 20 min to get the new name active
  3. discovery will tape place every 2 hours
  4. I expect the new services shown in CMK within the next 4 hours - but they aren’t


CMK recognizes there are new services but the are not registered to this host. The host itself is shown as DOWN. Download of the agent output works and show the full content.

Why are the services not automatically registered? I would expect this with the above settings.
Regards Robert

If your host is reachable and the Check_MK services brings no error then i would search for a problem with the discovery. But These two points should be solved first.

Hello

As you see, the host is reachable by CMK and I reveice a valid agent output.


Also with WATO I can do a discovery and all the services will be shown as currently not monitored. But I won’t do that now, I want to get these services autodiscovered.

This kind of host is currently a testmachine where we develop a special roll-out procedure. In the near future we will roll-out nearly 100 machines of this kind with a automated process. One step will be this autodiscovery.

The first picture shows there are 24 new services. The above CRIT state is not true. I ran the “download agent output” function from that red item.

How is this host configured? The host state itself is shown as Offline and the Check_Mk service has also a problem.

Hi
This situation is very strange.
I compared the definitions with a working one → found no difference
I did the following test:

  • I remove the definition
  • I defined the same host again (ted-g0007) with Ansible using RestAPI
  • I defined the host a 2nd time again (ted-g0007_ans2) with Ansible using RestAPI. This now a brand new host which was never defined before.
  • I defined the same host using WATO (ted-g0007_wato)
  • i defined this host again by copying a working one (ted-g0007_copy)

All hosts showing it as down. Hmmm ???
image

On the cmdline this host is responding:

OMD[inseo]:~$ cmk TED-g0007
[agent] Version: 1.5.0p7, OS: busybox, execution time 2.4 sec | execution_time=2.370 user_time=0.030 system_time=0.010 children_user_time=0.020 children_system_time=0.010 cmk_time_ds=2.290 cmk_time_agent=0.000


OMD[inseo]:~$ cmk -d TED-g0007
<<<check_mk>>>
Version: 1.5.0p7
AgentOS: busybox
Hostname: RAK7249
AgentDirectory: /etc/check_mk
DataDirectory: /var/lib/check_mk_agent
SpoolDirectory: /var/lib/check_mk_agent/spool
PluginsDirectory: /usr/lib/check_mk_agent/plugins
LocalDirectory: /usr/lib/check_mk_agent/local
<<<df>>>
rootfs                    4672       444      4228  10% /
/dev/root                 9984      9984         0 100% /rom


OMD[inseo]:~$ cmk --check-discovery TED-g0007
24 unmonitored services (cpu_threads:1, tcp_conn_stats:1, mounts:3, df:4, kernel_performance:1, mem_linux:1, local:2, uptime:1, cpu_loads:1, kernel_util:1, systemtime:1, lnx_if:7)(!), no vanished services found, no new host labels, rediscovery scheduled
unmonitored: cpu_threads: Number of threads
unmonitored: tcp_conn_stats: TCP Connections
unmonitored: mounts: Mount options of /rom
unmonitored: mounts: Mount options of /overlay
unmonitored: mounts: Mount options of /mnt/mmcblk0p1
unmonitored: df: Filesystem /tmp
unmonitored: df: Filesystem /overlay

OMD[inseo]:~$ cmk -D TED-g0007

TED-g0007                                                                      
Addresses:              TED-g0007.inseo.ddns
Tags:                   [Betriebssystem:other], [Kunde:cted], [SSH_Agent:yes], [Systemtyp:IoT], [Website_String_Check:Kontakt], [address_family:ip-v4-only], [agent:cmk-agent], [criticality:prod], [ip-v4:ip-v4], [networking:lan], [piggyback:auto-piggyback], [site:inseo], [snmp_ds:no-snmp], [tcp:tcp]
Labels:                 [cmk/os_family:busybox]
Host groups:            check_mk
Contact groups:         check-mk-notify
Agent mode:             Normal Checkmk agent, or special agent if configured
Type of agent:          
  Program: ssh -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no checkmk@TED-g0007.inseo.ddns
  Process piggyback data from /omd/sites/inseo/tmp/check_mk/piggyback/TED-g0007
Services:
  checktype item params description groups
  --------- ---- ------ ----------- ------

I don’t see a reason why it is showing as down. If I go into WATO, the services can be discovered.
Next I ran cmk -I TED-g0007 some time later the services are shown in TED-g0007.
image

The 3 remaining nodes are still reported as down. Discovery recognizes there are unmonitored services but the never get monitored automatically.

I upgraded to 2.0.0p9 but same result.
Do you have some ideas why the are not monitored automatically?

If you create a host with API then it is good to also trigger a service discovery over the API. If the host is created inside the GUI then the same why not trigger directly a service discovery over the GUI.
From command line the correct command is “cmk -II hostname” or “cmk -I hostname” for a discovery.
If i see the screenshots of your hosts the i would say again. These are no normal configured hosts. What is the host check command for these hosts?. Every host without services is shown as down why?
If a host is down then also no automatic service discovery will run.

Hello
Thanks for your reply. But I have further questions.
What do you mean by “These are no normal configured hosts”? As I said, I configured the host both via Ansible and RestAPI and WATO. Yes, I know if the host is down, there will be no discovery. I’m still wondering why it is down? As you see, the one with discovered services is UP, the other once are down at the same time.
I cannot force a service discovery because,
the machine is configured by an ansible playbook. The host will be checked through a vpn tunnel and a dynamic dns name. At first the new machine has to reboot and establish the vpn tunnel - no big deal. Next the new machine needs to register its DNS name. In our environment, this will take quite a time until the DNS name is known to CMK. So at the time when ansible registers this new host, there will be no DNS resolution, so a service discovery will fail. This is the reason why I want to discover the services automatically at a later time.
I use ssh as an individual agent (as you see) and I defined “use state of CMK agent” as host check cmd. Is this the right config? Or do I need some further setting for getting the host state?

That’s the reason why i said that this is no normal host. I think the host check command is changed for this hosts. They have no ping or smart ping as the host check command.

That’s the problem. You cannot use the state of a non existing service. There is a loop in your configuration.
No check_mk service until the first discovery - no automatic discovery as host is down - host stays down until check_mk service appears - start at the beginning

1 Like

Nice explanation from @andreas-doehler

To break that loop, perhaps you could implement an additional manual “ssh” check and use this service for the host check. Just connect to port 22, perhaps compare the version string received.

Since your Check_MK service will use ssh later on anyway, this does not really give any additional monitoring insight. But it might help prevent the deadlock you are experiencing.

Hello
Thanks for the reply. This situation is solved now and it worked as expected.
Same on me, I didn’t think about that loop.
We implemented ping as host check, this required some effort in the firewalls. I was thinking about a 2nd ssh check, but we decided to use ping because it gives a slightly better monitoring.
Again, many thanks