Ctl agent crashes every day same time on multiple hosts

CMK version: - 2.2.0p17/22
OS version: Ubuntu 22.04
Hi, since last week every day at 15:30 I get communication failed error [111] on multiple random hosts, after a few minutes most of the errors goes away, some of the hosts stays with it.

Last week I also upgraded the checkmk server from 2.2.0p17 to 2.2.0p22 and the agents to 2.2.0p22 from 2.1.0p13.
At first I tried downgrading the agents to 2.2.0p17 and then I also downgraded the checkmk server to 2.2.0p17, didn’t chagne anything.

By looking at the agent logs on 1 of the affected hosts I can see there is some kind of ctl shutdown and internal reinstalltion of internal components which causes issue with the TLS registration.
So after every time I have to remove and rergister the TLS.

Log snippet:

[Warn ] Object 'Win32_PerfRawData_MSExchangeISStore_MSExchangeISStore' in 50ms sends NO DATA
2024-02-26 15:30:36.670 [srv 6116] [Warn ] Sub Section 'msexch_isstore' has no data to provide, status = [2]
2024-02-26 15:30:36.691 [srv 6116] [Warn ] Object 'Win32_PerfRawData_MSExchangeRpcClientAccess_MSExchangeRpcClientAccess' in 20ms sends NO DATA
2024-02-26 15:30:36.692 [srv 6116] [Warn ] Sub Section 'msexch_rpcclientaccess' has no data to provide, status = [2]
2024-02-26 15:30:36.693 [srv 6116] [Warn ] Section 'msexch' cannot provide data
2024-02-26 15:30:36.693 [srv 6116] [Trace] Sending data 'msexch' id is [3063337078600] length [0]
2024-02-26 15:30:36.694 [srv 6116] perf: Section 'msexch' took [655] milliseconds
2024-02-26 15:30:36.695 [srv 6116] Received [342] bytes from 'agent_plugins'
2024-02-26 15:30:36.710 [srv 6116] [Trace] Sending data 'ps' id is [3063337078600] length [13750]
2024-02-26 15:30:36.711 [srv 6116] perf: Section 'ps' took [707] milliseconds
2024-02-26 15:30:36.721 [srv 6116] Received [867] bytes from 'logwatch'
2024-02-26 15:30:36.761 [srv 6116] Received [128] bytes from 'skype'
2024-02-26 15:30:36.762 [srv 6116] Object 'Win32_PerfRawData_W3SVC_WebService' in 733ms sends [2731] bytes
2024-02-26 15:30:36.763 [srv 6116] [Trace] Sending data 'wmi_webservices' id is [3063337078600] length [2777]
2024-02-26 15:30:36.764 [srv 6116] perf: Section 'wmi_webservices' took [735] milliseconds
2024-02-26 15:30:36.794 [srv 6116] Received [4765] bytes from 'dotnet_clrmemory'
2024-02-26 15:30:36.830 [srv 6116] Received [128] bytes from 'msexch'
2024-02-26 15:30:36.860 [srv 6116] Received [13878] bytes from 'ps'
2024-02-26 15:30:36.902 [srv 6116] Received [2905] bytes from 'wmi_webservices'
2024-02-26 15:30:37.649 [srv 6116] Initiating stop routine...
2024-02-26 15:30:37.650 [srv 6116] Stop Service called
2024-02-26 15:30:37.650 [srv 6116] [Trace] Stop request is set
2024-02-26 15:30:37.651 [srv 6116] [Trace] main Wait Loop END
2024-02-26 15:30:37.652 [srv 6116] Shutting down IO...
2024-02-26 15:30:37.652 [srv 6116] [Trace] Stopping execution
2024-02-26 15:30:37.657 [srv 6116] perf:  In [1504] milliseconds process 'powershell.exe -NoLogo -NoProfile -ExecutionPolicy Bypass -File "C:\ProgramData\checkmk\agent\plugins\iis_app_pool_state.ps1"' pid:[6284] SUCCEDED - generated [34] bytes of data in [1] blocks
2024-02-26 15:30:37.658 [srv 6116] [Trace] Provider 'plugins' is about to be started, id '3063337078600' port [mail:\\.\mailslot\Global\WinAgent_0]
2024-02-26 15:30:37.659 [srv 6116] [Trace] Sending data 'plugins' id is [3063337078600] length [47]
2024-02-26 15:30:37.660 [srv 6116] perf: Section 'plugins' took [0] milliseconds
2024-02-26 15:30:37.661 [srv 6116] Breaking Answer on stop
2024-02-26 15:30:37.661 [srv 6116] perf: Answer is ready in [1685] milliseconds
2024-02-26 15:30:37.688 [srv 6116] Received [175] bytes from 'plugins'
2024-02-26 15:30:46.795 [srv 6116] Send [38717] bytes of data to [Global\WinAgentCtl_5348_5524_0] - OK
2024-02-26 15:30:46.888 [ctl:5348] [cmk_agent_ctl::modes::pull][DEBUG] processed task!
2024-02-26 15:30:46.919 [ctl:5348] [cmk_agent_ctl::modes::pull][WARN] [::ffff:172.16.11.50]:53492: Request failed. (unexpected error: no server certificate chain resolved)
2024-02-26 15:30:47.310 [srv 6116] Exiting process queue
2024-02-26 15:30:47.312 [srv 6116] IO ends...
2024-02-26 15:30:47.313 [srv 6116] Thread is stopped
2024-02-26 15:30:47.313 [srv 6116] Processing dir 'C:\ProgramData\checkmk\agent\bin'
2024-02-26 15:30:47.343 [srv 6116] Killing process 'C:\ProgramData\checkmk\agent\bin\cmk-agent-ctl.exe'
2024-02-26 15:30:47.369 [srv 6116] [Warn ] error deleting controller
2024-02-26 15:30:47.577 [srv 6116] Controller is deleted
2024-02-26 15:30:47.591 [srv 6116] Cleanup called by service
2024-02-26 15:30:47.592 [srv 6116] Processing dir 'C:\ProgramData\checkmk\agent'
2024-02-26 15:30:47.649 [srv 6116] Killed [0] processes from the user folder
2024-02-26 15:30:47.650 [srv 6116] Clean on exit was not requested, not uninstall sequence
2024-02-26 15:30:47.651 [srv 6116] Service is stopped normally
2024-02-26 15:30:47.653 [srv 6116] COM closed
2024-02-26 15:30:52.777 [srv 7596] [Trace] Enabled Base
2024-02-26 15:30:52.791 [srv 7596] [Trace] Setting root. service: 'CheckMkService', preset: ''
2024-02-26 15:30:52.791 [srv 7596] [Trace] Try service: 'CheckMkService'
2024-02-26 15:30:52.792 [srv 7596] [Trace] Try registry 'CheckMkService'
2024-02-26 15:30:52.793 [srv 7596] [Trace] Service is found 'C:\Program Files (x86)\checkmk\service\check_mk_agent.exe'
2024-02-26 15:30:52.794 [srv 7596] Set root 'C:\Program Files (x86)\checkmk\service' from registry 'CheckMkService'
2024-02-26 15:30:52.794 [srv 7596] [Trace] Try registry 'CheckMkService'
2024-02-26 15:30:52.795 [srv 7596] Protect file from User write 'C:\Program Files (x86)\checkmk\service\check_mk_agent.exe'
2024-02-26 15:30:52.795 [srv 7596] Protect path from User access 'C:\Program Files (x86)\checkmk\service'
2024-02-26 15:30:52.804 [srv 7596] [Trace] Using root = 'C:\Program Files (x86)\checkmk\service' and data = 'C:\ProgramData\checkmk\agent' folders 
2024-02-26 15:30:52.817 [srv 7596] COM Initialized
2024-02-26 15:30:52.817 [srv 7596] COM initialized
2024-02-26 15:30:52.818 [srv 7596] Found root config on path C:\Program Files (x86)\checkmk\service\check_mk.yml
2024-02-26 15:30:52.839 [srv 7596] [Trace] Enabled Debug
2024-02-26 15:30:52.839 [srv 7596] Loaded start config 'C:\Program Files (x86)\checkmk\service\check_mk.yml','C:\ProgramData\checkmk\agent\bakery','C:\ProgramData\checkmk\agent\check_mk.user.yml'
2024-02-26 15:30:52.840 [srv 7596] service to run
2024-02-26 15:30:52.861 [srv 7596] Service Main
2024-02-26 15:30:52.862 [srv 7596] Service handlers registered
2024-02-26 15:30:52.863 [srv 7596] [Trace] Installing cap file 'C:\Program Files (x86)\checkmk\service\install\plugins.cap'
2024-02-26 15:30:52.863 [srv 7596] Timestamp OK, checking file content...
2024-02-26 15:30:52.864 [srv 7596] [Trace] Installing of CAP file is not required
2024-02-26 15:30:52.865 [srv 7596] [Trace] Installing yml file 'C:\Program Files (x86)\checkmk\service\install\check_mk.install.yml'
2024-02-26 15:30:52.865 [srv 7596] Target File 'C:\ProgramData\checkmk\agent\install\check_mk.install.yml' is absent, reinstall is mandatory
2024-02-26 15:30:52.866 [srv 7596] Reinstalling 'C:\ProgramData\checkmk\agent\install\check_mk.install.yml' with 'C:\Program Files (x86)\checkmk\service\install\check_mk.install.yml'
2024-02-26 15:30:52.866 [srv 7596] This Option/YML installation form MSI is ENABLED
2024-02-26 15:30:52.867 [srv 7596] Remove 'C:\ProgramData\checkmk\agent\install\check_mk.install.yml' [OK]
2024-02-26 15:30:52.868 [srv 7596] Supplied yaml 'C:\Program Files (x86)\checkmk\service\install\check_mk.install.yml' will not be installed
2024-02-26 15:30:52.869 [srv 7596] [Trace] Copy file 'C:\Program Files (x86)\checkmk\service\install\checkmk.dat' to 'C:\ProgramData\checkmk\agent\install\checkmk.dat'
2024-02-26 15:30:52.869 [srv 7596] Timestamp OK, checking file content...
2024-02-26 15:30:52.870 [srv 7596] [Trace] Copy is not required, the file is already exists
2024-02-26 15:30:52.871 [srv 7596] Skip installing user yml file
2024-02-26 15:30:52.871 [srv 7596] Timestamp OK, checking file content...
2024-02-26 15:30:52.872 [srv 7596] Starting upgrade(migration) process...
2024-02-26 15:30:52.873 [srv 7596] [Trace] Legacy Agent not found Upgrade is not possible
2024-02-26 15:30:52.874 [srv 7596] [Trace] trying path C:\Program Files (x86)\checkmk\service
2024-02-26 15:30:52.874 [srv 7596] Found root config on path C:\Program Files (x86)\checkmk\service\check_mk.yml
2024-02-26 15:30:52.875 [srv 7596] [Trace] Loading 'C:\Program Files (x86)\checkmk\service\check_mk.yml'
2024-02-26 15:30:52.877 [srv 7596] [Trace] Loading 'C:\ProgramData\checkmk\agent\bakery\check_mk.bakery.yml'
2024-02-26 15:30:52.878 [srv 7596] [Trace] C:\ProgramData\checkmk\agent\bakery\check_mk.bakery.yml is absent, return
2024-02-26 15:30:52.878 [srv 7596] [Trace] Loading 'C:\ProgramData\checkmk\agent\check_mk.user.yml'
2024-02-26 15:30:52.883 [srv 7596] [Trace] Target 'folders' is empty, overriding with source
2024-02-26 15:30:52.884 [srv 7596] Loaded Config Files by Agent [2.2.0p17,64bit,release,Dec 11 2023,19:29:57] @ 'Win10-64 server'
    root:   'C:\Program Files (x86)\checkmk\service\check_mk.yml' size=12750 [OK]
    bakery: 'C:\ProgramData\checkmk\agent\bakery\check_mk.bakery.yml' size=0 [OK]
    user:   'C:\ProgramData\checkmk\agent\check_mk.user.yml' size=17375 [OK]
2024-02-26 15:30:52.887 [srv 7596] [Trace] Enabled Debug
2024-02-26 15:30:52.888 [srv 7596] Loaded start config 'C:\Program Files (x86)\checkmk\service\check_mk.yml','C:\ProgramData\checkmk\agent\bakery','C:\ProgramData\checkmk\agent\check_mk.user.yml'
2024-02-26 15:30:52.889 [srv 7596] [Trace] Successful start of thread
2024-02-26 15:30:52.904 [srv 7596] The network is available
2024-02-26 15:30:52.905 [srv 7596] starting controller
2024-02-26 15:30:52.906 [srv 7596] try to kill
2024-02-26 15:30:52.907 [srv 7596] Processing dir 'C:\ProgramData\checkmk\agent\bin'
2024-02-26 15:30:52.966 [srv 7596] killed 0 processes in 'C:\ProgramData\checkmk\agent\bin'
2024-02-26 15:30:53.122 [srv 7596] Agent controller 'C:\ProgramData\checkmk\agent\bin\cmk-agent-ctl.exe -vv daemon --agent-channel ms/Global\WinAgent_0' started pid [6636]
2024-02-26 15:30:53.172 [ctl:6636] [cmk_agent_ctl][INFO] starting
2024-02-26 15:30:53.231 [ctl:6636] [cmk_agent_ctl][INFO] Loaded config from '"C:\\ProgramData\\checkmk\\agent\\cmk-agent-ctl.toml"', connection registry from '"C:\\ProgramData\\checkmk\\agent\\registered_connections.json"'
2024-02-26 15:30:53.281 [ctl:6636] [cmk_agent_ctl::modes::daemon][INFO] Could not load pre-configured connections from "C:\\ProgramData\\checkmk\\agent\\pre_configured_connections.json": The system cannot find the file specified. (os error 2)
2024-02-26 15:30:53.309 [ctl:6636] [cmk_agent_ctl::misc][DEBUG] Sleeping 20s to avoid DDOSing of sites
2024-02-26 15:30:53.349 [ctl:6636] [cmk_agent_ctl::modes::pull][INFO] Start listening for incoming pull requests
2024-02-26 15:30:53.376 [ctl:6636] [cmk_agent_ctl::modes::pull][INFO] Listening on [::]:6556 for incoming pull connections (IPv6 & IPv4 if activated)
2024-02-26 15:30:53.421 [ctl:6636] [cmk_agent_ctl::misc][DEBUG] Sleeping 40s to avoid DDOSing of sites
2024-02-26 15:30:54.142 [srv 7596] Controller has started: firewall to controller
2024-02-26 15:30:54.142 [srv 7596] Firewall mode is set to configure, adding rule...
2024-02-26 15:30:54.143 [srv 7596] Removing all 'Checkmk Agent' app: 'C:\ProgramData\checkmk\agent\bin\cmk-agent-ctl.exe'
2024-02-26 15:30:54.514 [srv 7596] Removed 1 old rules.
2024-02-26 15:30:54.615 [srv 7596] Firewall rule 'Checkmk Agent' had been added successfully for ports [6556]
2024-02-26 15:30:54.616 [srv 7596] Reading module config normal
2024-02-26 15:30:54.616 [srv 7596] Processed [1] module(s)
2024-02-26 15:30:54.617 [srv 7596] Finding modules
2024-02-26 15:30:54.617 [srv 7596] Module 'C:\Program Files (x86)\checkmk\service\install\python-3.cab' is added to the list
2024-02-26 15:30:54.618 [srv 7596] Installing modules
2024-02-26 15:30:54.619 [srv 7596] Install module python-3
2024-02-26 15:30:54.619 [srv 7596] 'C:\ProgramData\checkmk\agent\install\modules\python-3.cab' is absent, no need to uninstall
2024-02-26 15:30:54.620 [srv 7596] Installation of the module 'python-3' is not required, module file 'C:\Program Files (x86)\checkmk\service\install\python-3.cab'is absent or too short. Backup will be uninstalled
2024-02-26 15:30:54.621 [srv 7596] Module 'python-3' has no package installed, this is normal
2024-02-26 15:30:54.621 [srv 7596] [Warn ] Module 'python-3' has no package in backup dir 'C:\ProgramData\checkmk\agent\install\modules'
2024-02-26 15:30:54.622 [srv 7596] Pre Start actions
2024-02-26 15:30:54.623 [srv 7596] [Warn ] OHM file 'C:\ProgramData\checkmk\agent\bin\OpenHardwareMonitorCLI.exe' is not found
2024-02-26 15:30:54.623 [srv 7596] Allowed Extensions: [checkmk.py,py,exe,bat,vbs,cmd,ps1]
2024-02-26 15:30:54.624 [srv 7596] [Trace] Left [37] files to execute
2024-02-26 15:30:54.626 [srv 7596] Plugin 'C:\ProgramData\checkmk\agent\plugins\iis_app_pool_state.ps1'  is  sync with age:0 timeout:60 retry:0
2024-02-26 15:30:54.627 [srv 7596] [Trace] Left [1] files to execute in 'plugins'
2024-02-26 15:30:54.628 [srv 7596] Allowed Extensions: [checkmk.py,py,exe,bat,vbs,cmd,ps1]
2024-02-26 15:30:54.628 [srv 7596] [Trace] Left [37] files to execute
2024-02-26 15:30:54.631 [srv 7596] Plugin 'C:\ProgramData\checkmk\agent\plugins\iis_app_pool_state.ps1'  is  sync with age:0 timeout:60 retry:0
2024-02-26 15:30:54.631 [srv 7596] [Trace] Left [1] files to execute in 'plugins'
2024-02-26 15:30:54.632 [srv 7596] Allowed Extensions: [exe,bat,vbs,cmd,ps1]
2024-02-26 15:30:54.632 [srv 7596] [Trace] Left [0] files to execute
2024-02-26 15:30:54.633 [srv 7596] [Trace] Left [0] files to execute in 'local'
2024-02-26 15:30:54.633 [srv 7596] Pre Start actions ended
2024-02-26 15:30:55.144 [srv 7596] Waiting for async threads [0]
2024-02-26 15:30:55.145 [srv 7596] Left async threads [0] after waiting 0ms
2024-02-26 15:30:55.146 [srv 7596] Starting io with 0 6636
2024-02-26 15:30:55.146 [srv 7596] main Wait Loop
2024-02-26 15:30:55.147 [srv 7596] Applying config auto restart_on_crash:true error_mode: log
2024-02-26 15:30:55.150 [srv 7596] cma::world::ExternalPort::mailslotThreadProc:  started
2024-02-26 15:31:33.198 [ctl:6636] [cmk_agent_ctl::modes::renew_certificate][DEBUG] Checking registered connections for certificate expiry.
2024-02-26 15:31:36.339 [ctl:6636] [cmk_agent_ctl::modes::pull][INFO] [::ffff:172.16.11.50]:42194: Handling pull request.
2024-02-26 15:31:36.370 [ctl:6636] [cmk_agent_ctl::modes::pull][DEBUG] [::ffff:172.16.11.50]:42194: Handling pull request DONE (Task detached).
2024-02-26 15:31:36.449 [ctl:6636] [cmk_agent_ctl::modes::pull][DEBUG] handle_request starts
2024-02-26 15:31:36.483 [srv 7596] Request is '::ffff:172.16.11.50 Global\WinAgentCtl_6636_1940_0'
2024-02-26 15:31:36.484 [srv 7596] [Warn ] OHM file 'C:\ProgramData\checkmk\agent\bin\OpenHardwareMonitorCLI.exe' is not found
2024-02-26 15:31:36.485 [srv 7596] Allowed Extensions: [checkmk.py,py,exe,bat,vbs,cmd,ps1]
2024-02-26 15:31:36.486 [srv 7596] [Trace] Left [37] files to execute
2024-02-26 15:31:36.489 [srv 7596] Plugin 'C:\ProgramData\checkmk\agent\plugins\iis_app_pool_state.ps1'  is  sync with age:0 timeout:60 retry:0
2024-02-26 15:31:36.489 [srv 7596] [Trace] Left [1] files to execute in 'plugins'
2024-02-26 15:31:36.490 [srv 7596] Generating answer with id [3080202446000]
2024-02-26 15:31:36.491 [srv 7596] Id is [3123851746300] for ip [::ffff:172.16.11.50]
2024-02-26 15:31:36.492 [srv 7596] [Trace] Provider 'wmi_cpuload' is direct called, id '3123851746300' port [mail:\\.\mailslot\Global\WinAgent_0]
2024-02-26 15:31:36.531 [ctl:6636] [rustls::conn][WARN] Sending fatal alert AccessDenied
2024-02-26 15:31:36.539 [srv 7596] [Trace] Found Name at 29
2024-02-26 15:31:36.540 [srv 7596] [Trace] Found NumberOfLogicalProcessors at 32
2024-02-26 15:31:36.540 [srv 7596] [Trace] Found NumberOfProcessors at 33
2024-02-26 15:31:36.987 [srv 7596] [Trace] Sending data 'wmi_cpuload' id is [3123851746300] length [238]
2024-02-26 15:31:36.988 [srv 7596] perf: Section 'wmi_cpuload' took [496] milliseconds
2024-02-26 15:31:36.989 [srv 7596] [Trace] Provider 'uptime' is about to be started, id '3123851746300' port [mail:\\.\mailslot\Global\WinAgent_0]
2024-02-26 15:31:36.989 [srv 7596] Using file 'C:\Program Files (x86)\checkmk\service\check_mk_agent.exe' for winperf
2024-02-26 15:31:36.990 [srv 7596] [Trace] Sending data 'uptime' id is [3123851746300] length [17]
2024-02-26 15:31:36.991 [srv 7596] [Warn ] No forking to get winperf data: may lead to handle leak.
2024-02-26 15:31:36.991 [srv 7596] perf: Section 'uptime' took [1] milliseconds
2024-02-26 15:31:36.992 [srv 7596] [Trace] Loaded [3] entries in LogWatch

Found the issue old automation that went nuts and tried registering already registered clients

1 Like