Agent-receiver worker crashing at registering

Hello,

I have an issue with the registration of the host on a server : the agent-receiver on server-side is always crashing around 15/20 seconds and restarts.
The server and the host are on the same network, the same virtualization cluster.
Host can telnet on port 8000 of the server and a curl request works (although there is an issue with the certificate but I dont think it’s the issue since i’m passing --trust-cert with the registration command).
We use distributed monitoring and this host and this server are on site2 (site1 being the main one).

Error message:

Host side :

cmk-agent-ctl register --hostname webserver --server 10.44.251.1:8000 --site site2 --user automation --password <password> --trust-cert -v
INFO [cmk_agent_ctl] starting
INFO [cmk_agent_ctl] Loaded config from '"/etc/check_mk/cmk-agent-ctl.toml"', legacy pull 'LegacyPullMarker("/var/lib/cmk-agent/allow-legacy-pull")' exists
ERROR [cmk_agent_ctl] Error pairing with 10.44.251.1:8000/site2

Caused by:
    0: error sending request for url (https://10.44.251.1:8000/site2/agent-receiver/pairing): connection closed before message completed
    1: connection closed before message completed

Error on server side in agent-receiver/error.log :

[2022-06-13 17:30:48 +0200] [3787746] [DEBUG] Current configuration:
  config: ./gunicorn.conf.py
  wsgi_app: None
  bind: ['0.0.0.0:8000']
  backlog: 2048
  workers: 1
  worker_class: uvicorn.workers.UvicornWorker
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  print_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /opt/omd/sites/site2
  daemon: True
  raw_env: []
  pidfile: /omd/sites/site2/tmp/run/agent-receiver.pid
  worker_tmp_dir: None
  user: 996
  group: 1000
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: /omd/sites/site2/var/log/agent-receiver/access.log
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: /omd/sites/site2/var/log/agent-receiver/error.log
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags:
  statsd_prefix:
  proc_name: None
  default_proc_name: agent_receiver.apps:main_app()
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7fe0c7f55550>
  on_reload: <function OnReload.on_reload at 0x7fe0c7f55670>
  when_ready: <function WhenReady.when_ready at 0x7fe0c7f55790>
  pre_fork: <function Prefork.pre_fork at 0x7fe0c7f558b0>
  post_fork: <function Postfork.post_fork at 0x7fe0c7f559d0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7fe0c7f55af0>
  worker_int: <function WorkerInt.worker_int at 0x7fe0c7f55c10>
  worker_abort: <function WorkerAbort.worker_abort at 0x7fe0c7f55d30>
  pre_exec: <function PreExec.pre_exec at 0x7fe0c7f55e50>
  pre_request: <function PreRequest.pre_request at 0x7fe0c7f55f70>
  post_request: <function PostRequest.post_request at 0x7fe0c7f62040>
  child_exit: <function ChildExit.child_exit at 0x7fe0c7f62160>
  worker_exit: <function WorkerExit.worker_exit at 0x7fe0c7f62280>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7fe0c7f623a0>
  on_exit: <function OnExit.on_exit at 0x7fe0c7f624c0>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: /omd/sites/site2/etc/ssl/agent_receiver_cert.pem
  certfile: /omd/sites/site2/etc/ssl/agent_receiver_cert.pem
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2022-06-13 17:30:48 +0200] [3787746] [INFO] Starting gunicorn 20.1.0
[2022-06-13 17:30:48 +0200] [3787746] [DEBUG] Arbiter booted
[2022-06-13 17:30:48 +0200] [3787746] [INFO] Listening at: https://0.0.0.0:8000 (3787746)
[2022-06-13 17:30:48 +0200] [3787746] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2022-06-13 17:30:48 +0200] [3787751] [INFO] Booting worker with pid: 3787751
[2022-06-13 17:30:48 +0200] [3787746] [DEBUG] 1 workers
[2022-06-13 17:30:49 +0200] [3787751] [INFO] Started server process [3787751]
[2022-06-13 17:30:49 +0200] [3787751] [INFO] Waiting for application startup.
[2022-06-13 17:30:49 +0200] [3787751] [INFO] Application startup complete.
[2022-06-13 17:31:35 +0200] [3787746] [CRITICAL] WORKER TIMEOUT (pid:3787751)
[2022-06-13 17:31:36 +0200] [3787746] [WARNING] Worker with pid 3787751 was terminated due to signal 9

There is no log in the agent-receiver.log file and in the access.log, which seems strange to me like the process is crashing before even accepting the HTTP request.

CMK version: 2.1.0p2.cre
OS version: Ubuntu 20.04.4 LTS

This behavior doesn’t happen on the main site, I believe I openned the right network ports on both sides.
I did not find a way to put the agent-receiver on debug mode to get more logs to investigate on if anyone knows how to do this.

It would be great if anybody have helpful tips, debugging tricks or even the solution to this issue (I checked here and did not see it).

Thanks

Hi,

If it’s help with anything I see these logs whenever the process crash :

ERROR: apport (pid 2182321) Mon Jun 27 14:40:52 2022: called for pid 2178736, signal 6, core limit 0, dump mode 1
ERROR: apport (pid 2182321) Mon Jun 27 14:40:52 2022: script: /opt/omd/versions/2.1.0p3.cre/bin/gunicorn, interpreted by /opt/omd/versions/2.1.0p3.cre/bin/python3.9 (command line "python3 /omd/sites/site2/bin/gunicorn -D -p /omd/sites/site2/tmp/run/agent-receiver.pid --error-logfile /omd/sites/site2/var/log/agent-receiver/error.log --access-logfile /omd/sites/site2/var/log/agent-receiver/access.log --keyfile /omd/sites/site2/etc/ssl/agent_receiver_cert.pem --certfile /omd/sites/site2/etc/ssl/agent_receiver_cert.pem -b 0.0.0.0:8000 --log-level debug -k uvicorn.workers.UvicornWorker agent_receiver.apps:main_app()")
ERROR: apport (pid 2182321) Mon Jun 27 14:40:52 2022: is_closing_session(): no DBUS_SESSION_BUS_ADDRESS in environment
ERROR: apport (pid 2182321) Mon Jun 27 14:40:53 2022: wrote report /var/crash/_opt_omd_versions_2.1.0p3.cre_bin_gunicorn.996.crash

I checked and the DBUS_SESSION_BUS_ADDRESS is set to unix:path=/run/user/0/bus which exists.

Turns out it was DNS-related, I had wrong entries in my hosts file.
Figured out using strace.