Agent registration hangs

CMK version: 2.1.0p19 CRE
OS version: Microsoft Windows 10 Pro 64-bit

It works fine of another dozens of various clients I tried it with but on this specific laptop it gives:

PS C:\Program Files (x86)\checkmk\service> .\cmk-agent-ctl.exe register --hostname laptop.example.com --server cmk.example.com --site example --user automation --password ********
.\cmk-agent-ctl.exe : Attempting to register at cmk.example.com:8000/example. Server certificate details:
At line:1 char:1
+ .\cmk-agent-ctl.exe register --hostname laptop.example.com --s ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Attempting to r...ficate details::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

PEM-encoded certificate:
-----BEGIN CERTIFICATE-----
******************************************
**********CERTIFICATE BLOCK HERE**********
******************************************
-----END CERTIFICATE-----
Issued by:
    Site 'example' local CA
Issued to:
    example
Validity:
    From Mon, 23 Jan 2023 19:53:53 +0000
    To   Sat, 26 May 3021 19:53:53 +0000
Do you want to establish this connection? [Y/n]

I get to the prompt but no >
then it just seems to hang.

After (apparently smooth) upgrade of both the server and client to 2.2.0p25 CRE
I’m not seeing the top errors any more and the prompt appears.
Then this happens:


Do you want to establish this connection? [Y/n]
> y
[2024-04-25 19:17:00.245355 +01:00] ERROR [cmk_agent_ctl] src\main.rs:29: Error registering existing host at https://cmk.example.com:8000/example

Caused by:
    Request failed with code 500 Internal Server Error: Internal Server Error
PS C:\Program Files (x86)\checkmk\service>

Even if the host is already registered that shouldn’t result in 500 error, correct?

Is inside your site the “automation” user existing? This user needs to be also an automation account.

Yes, I’ve registered dozens of hosts before with this user/password combination in the past.
Following upgrade to 2.2.0p25 CRE I’ve checked the user again. It still exists and has admin privileges.
I’ve changed the password and used it to log into WATO without problem but unfortunately the registration 500 error persists.

I’ve tried looking at the logs but there is nothing under /opt/omd/sites/mysite/var/log/ with today’s timestamp.
That’s weird and might be related permission change prompts during the upgrade:

 Permission conflict at etc
  The  proposed permissions of etc  have changed from 0750  to 0751 in the new
  version,  but you have set 0755. May I use the new default permissions or do
  you want to keep yours?

  keep        Keep permissions at 0755
  default     Set permission to 0751
  shell       Open a shell for looking around
  abort       Stop here and abort update!

  k/d/s/a ==>  d

Same prompt and same answers for:
etc/omd
etc/omd/allocated_ports
etc/init.d/nsca
etc/nsca

By looking at the upgrade log this doesn’t sound right:

 * Vanished       etc/ssl/certs
 ! Permission:    cannot change 0000 -> 0640 etc/ssl/certs: [Errno 2] No such file or directory: '/omd/sites/mscience/etc/ssl/certs'
 * Vanished       etc/ssl/private
 ! Permission:    cannot change 0000 -> 0640 etc/ssl/private: [Errno 2] No such file or directory: '/omd/sites/mscience/etc/ssl/private'

That’s the certificate directory used by agent’s registration, right?

In /var/log/daemon.log I see:

Apr 26 10:17:01 cmk systemd[1]: Starting User Login Management...
Apr 26 10:17:01 cmk systemd[420288]: systemd-logind.service: Failed to set up mount namespacing: /run/systemd/unit-root/proc: Permission denied
Apr 26 10:17:01 cmk systemd[420288]: systemd-logind.service: Failed at step NAMESPACE spawning /lib/systemd/systemd-logind: Permission denied
Apr 26 10:17:01 cmk systemd[1]: systemd-logind.service: Main process exited, code=exited, status=226/NAMESPACE
Apr 26 10:17:01 cmk systemd[1]: systemd-logind.service: Failed with result 'exit-code'.
Apr 26 10:17:01 cmk systemd[1]: Failed to start User Login Management.
Apr 26 10:17:01 cmk systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 5.
Apr 26 10:17:01 cmk systemd[1]: Stopped User Login Management.
Apr 26 10:17:01 cmk systemd[1]: modprobe@drm.service: Start request repeated too quickly.
Apr 26 10:17:01 cmk systemd[1]: modprobe@drm.service: Failed with result 'start-limit-hit'.
Apr 26 10:17:01 cmk systemd[1]: Failed to start Load Kernel Module drm.
Apr 26 10:17:01 cmk systemd[1]: systemd-logind.service: Start request repeated too quickly.
Apr 26 10:17:01 cmk systemd[1]: systemd-logind.service: Failed with result 'exit-code'.
Apr 26 10:17:01 cmk systemd[1]: Failed to start User Login Management.
Apr 26 10:17:06 cmk ntpd[124]: local_clock: ntp_loopfilter.c line 817: ntp_adjtime: Operation not permitted

My server runs on a Debian 11.6 LXC container.
I’ve found so many potential issues that I’m not sure where to start :slight_smile:

Automation password and permissions are definitely correct.

500 error only appears in Windows (tried 10 and 11 Pro). It runs fine on Linux clients.

Failures log the following (missing automation.secret file):

[2024-04-29 20:10:08 +0100] [300] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/omd/sites/mysite/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/mysite/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/mysite/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/routing.py", line 487, in handle
    await self.app(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/omd/sites/mysite/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/mysite/lib/python3.11/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/omd/sites/mysite/lib/python3.11/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/mysite/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/mysite/lib/python3.11/site-packages/agent_receiver/endpoints.py", line 110, in register_existing
    _sign_agent_csr(
  File "/omd/sites/mysite/lib/python3.11/site-packages/agent_receiver/endpoints.py", line 88, in _sign_agent_csr
    internal_credentials(),
    ^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/mysite/lib/python3.11/site-packages/agent_receiver/utils.py", line 76, in internal_credentials
    secret = (users_dir() / INTERNAL_REST_API_USER / "automation.secret").read_text().strip()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/mysite/lib/python3.11/pathlib.py", line 1058, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/mysite/lib/python3.11/pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/omd/sites/mysite/var/check_mk/web/automation/automation.secret'

What’s the best way to fix it?

That is exactly what I mean. Your user with the name “automation” is no real automation account. For automation accounts it is not allowed to set passwords. Please only use the automation secret field.

It worked, thanks!

Questions:

  1. Why did it work in Linux and not in Windows before? Something to do with REST API changes?

  2. What’s the point of “Agent registration user” role? If that’s the only one assigned to “automation” user then registrations fail with:

Request failed with code 403 Forbidden: Unauthorized - Details: Unauthorized to read the global settings

It goes away once I promote “automation” to Administrator.

The problem with the automation role is that it only works for TLS registration. For agent updater registration you need to grant 2 or 3 additional rights.
At the moment i cannot look what rights are needed.

Ensure the role has the following permissions:

"Use the GUI at all"
"Register Host & download monitoring agents of your hosts",
"Register all hosts & download all monitoring agents"

I can find “Use the GUI at all” but not the other two.

Here is my full list of roles permissions:

I always thought that the agent updater is only part of the bakery in the Enterprise / Cloud Edition ?

I’m only interested in getting agent registrations to work and Administrator level feels way to excessive for that.
So, given my current situation with the roles permissions, which ones do I need to apply?

I misread the registration thing. The extra rights that @aeckstein and myself mentioned are only relevant for enterprise edition and agent updater registration.

For the TLS registration the built-in role is sufficient for the account used at the registration time. The internal automation user should not be used as this user needs administrator rights. It is also not easy to see why your system had problems as there where also many bugfixes and changes over time on the function.