[agent] Communication failed: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:2546)

Version: CEE 2.2.0p7
OS: Debian Linux 11.7

Error message: [agent] Communication failed: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:2546)

Hi all,

we recently updated our distributed monitoring from 2.0.0p37 via latest 2.1.0 and 2.2.0p5, to 2.2.0p7.
I built new agents using the agent bakery, deployed them and enabled the TLS connection between all slaves and the master successfully.

Now I tried to change the agent communication for several of our monitored hosts (Linux and Windows). It was always the same behavior for Linux and Windows clients. The registration against the corresponding slaves works fine and I can see with cmk-agent-ctl status, that the host is registered and in pull-mode now:

But the monitoring is not able to query the host anymore:

I reinstalled the agent manually and re-registered it multiple times, but nothing helped.

From the log in C:\ProgramData\checkmk\agent\log\check_mk.log:

2023-07-31 18:00:32.394 [ctl:4016] [cmk_agent_ctl::modes::pull][INFO] [::ffff:10.*.*.*]:48286: Handling pull request.
2023-07-31 18:00:32.418 [ctl:4016] [cmk_agent_ctl::modes::pull][DEBUG] [::ffff:10.*.*.*]:48286: Handling pull request DONE (Task detached).
2023-07-31 18:00:32.449 [ctl:4016] [cmk_agent_ctl::modes::pull][DEBUG] handle_request starts
2023-07-31 18:00:32.488 [srv 7448] Request is '::ffff:10.*.*.* Global\WinAgentCtl_4016_3324_0'
2023-07-31 18:00:32.515 [srv 7448] [Warn ] OHM file 'C:\ProgramData\checkmk\agent\bin\OpenHardwareMonitorCLI.exe' is not found
2023-07-31 18:00:32.517 [ctl:4016] [rustls::server::hs][DEBUG] decided upon suite TLS13_AES_256_GCM_SHA384
2023-07-31 18:00:32.543 [ctl:4016] [rustls::conn][WARN] Sending fatal alert HandshakeFailure
[...]
2023-07-31 18:00:36.731 [ctl:4016] [rustls::server::hs][DEBUG] decided upon suite TLS13_AES_256_GCM_SHA384
[...]
2023-07-31 18:00:42.059 [ctl:4016] [cmk_agent_ctl::modes::pull][DEBUG] processed task!

Then I tried to register the host to a second CMK site (teststg1), which is running on the same VM as the first site “stg1”. This instance was a copy of “stg1”, where we tested the upgrade steps, before we upgraded our PROD system). That worked also fine. So the host “pathfinder” is now registered to two different sites:

Now the fun part, I try to query from both sites (left: site stg1; right: site teststg1):

So the query works fine for the site “teststg1”, but fails for site “stg1” with the TLS handshake error.

Full output of Output of “cmk --debug -vvn pathfinder”: on site “stg1”:

OMD[stg1]:~$ cmk --debug -vvn pathfinder
Checkmk version 2.2.0p7
+ FETCHING DATA
  Source: SourceInfo(hostname='pathfinder', ipaddress='10.*.*.*', ident='agent', fetcher_type=<FetcherType.TCP: 8>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f39f7571b90]
Read from cache: AgentFileCache(pathfinder, path_template=/omd/sites/stg1/tmp/check_mk/cache/{hostname}, max_age=MaxAge(checking=0, discovery=120, inventory=120), simulation=False, use_only_cache=False, file_cache_mode=6)
Not using cache (does not exist)
[TCPFetcher] Execute data source
Connecting via TCP to 10.*.*.*:6556 (5.0s timeout)
Detected transport protocol: TransportProtocol.TLS (b'16')
Reading data from agent via TLS socket
Reading data from agent
Closing TCP connection to 10.*.*.*:6556
[cpu_tracking] Stop [7f39f7571b90 - Snapshot(process=posix.times_result(user=0.010000000000000009, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.019999999552965164))]
  Source: SourceInfo(hostname='pathfinder', ipaddress='10.*.*.*', ident='piggyback', fetcher_type=<FetcherType.PIGGYBACK: 4>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f39f6601ed0]
Read from cache: NoCache(pathfinder, path_template=/dev/null, max_age=MaxAge(checking=0.0, discovery=0.0, inventory=0.0), simulation=False, use_only_cache=False, file_cache_mode=1)
[PiggybackFetcher] Execute data source
Piggyback file '/omd/sites/stg1/tmp/check_mk/piggyback/pathfinder/HSTG26': Successfully processed from source 'HSTG26'
No piggyback files for '10.*.*.*'. Skip processing.
[cpu_tracking] Stop [7f39f6601ed0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
<<<hyperv_vmstatus:cached(1690820202,90)>>> / Transition NOOPParser -> HostSectionParser
<<<hyperv_checkpoints:cached(1690820202,90)>>> / Transition HostSectionParser -> HostSectionParser
<<<labels:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
No persisted sections
  HostKey(hostname='pathfinder', source_type=<SourceType.HOST: 1>)  -> Not adding sections: SSLError(1, '[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:2546)')
  HostKey(hostname='pathfinder', source_type=<SourceType.HOST: 1>)  -> Add sections: ['hyperv_checkpoints', 'hyperv_vmstatus', 'labels']
Received no piggyback data
[cpu_tracking] Start [7f39f620fb90]
value store: synchronizing
Trying to acquire lock on /omd/sites/stg1/tmp/check_mk/counters/pathfinder
Got lock on /omd/sites/stg1/tmp/check_mk/counters/pathfinder
value store: loading from disk
Releasing lock on /omd/sites/stg1/tmp/check_mk/counters/pathfinder
Released lock on /omd/sites/stg1/tmp/check_mk/counters/pathfinder
CPU utilization      PEND Check plugin received no monitoring data
Check_MK Agent       PEND Check plugin received no monitoring data
Disk IO SUMMARY      PEND Check plugin received no monitoring data
HyperV Checkpoints   No Checkpoints found
HyperV Status        Integration Service State: Ok
Interface Microsoft Hyper-V Network Adapter PEND Check plugin received no monitoring data
LOG Application      PEND Check plugin received no monitoring data
LOG System           PEND Check plugin received no monitoring data
Memory and pagefile  PEND Check plugin received no monitoring data
Processor Queue      PEND Check plugin received no monitoring data
Service Summary      PEND Check plugin received no monitoring data
Sessions             PEND Check plugin received no monitoring data
System Time          PEND Check plugin received no monitoring data
TCP Connections      PEND Check plugin received no monitoring data
Uptime               PEND Check plugin received no monitoring data
Windows License      PEND Check plugin received no monitoring data
fs_C:/               PEND Check plugin received no monitoring data
service_Netlogon     PEND Check plugin received no monitoring data
service_TermService  PEND Check plugin received no monitoring data
service_WinRM        PEND Check plugin received no monitoring data
Piggyback file '/omd/sites/stg1/tmp/check_mk/piggyback/pathfinder/HSTG26': Successfully processed from source 'HSTG26'
No piggyback files for '10.*.*.*'. Skip processing.
[cpu_tracking] Stop [7f39f620fb90 - Snapshot(process=posix.times_result(user=0.010000000000000009, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.010000001639127731))]
[agent] SSLError(1, '[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:2546)')(!!), [piggyback] Successfully processed from source 'HSTG26', Missing monitoring data for plugins: checkmk_agent, df, logwatch, mem_win, services, services_summary, systemtime, tcp_conn_stats, uptime, win_license, winperf_if, winperf_phydisk, winperf_processor_util, winperf_ts_sessions, wmi_cpuload(!), execution time 0.0 sec | execution_time=0.030 user_time=0.020 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.010

Full output of Output of “cmk --debug -vvn pathfinder”: on site “teststg1”:

OMD[teststg1]:~$ cmk --debug -vvn pathfinder
Checkmk version 2.2.0p7
+ FETCHING DATA
  Source: SourceInfo(hostname='pathfinder', ipaddress='10.*.*.*', ident='agent', fetcher_type=<FetcherType.TCP: 8>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f995e226190]
Read from cache: AgentFileCache(pathfinder, path_template=/omd/sites/teststg1/tmp/check_mk/cache/{hostname}, max_age=MaxAge(checking=0, discovery=120, inventory=120), simulation=False, use_only_cache=False, file_cache_mode=6)
Not using cache (Too old. Age is 49 sec, allowed is 0 sec)
[TCPFetcher] Execute data source
Connecting via TCP to 10.*.*.*:6556 (5.0s timeout)
Detected transport protocol: TransportProtocol.TLS (b'16')
Reading data from agent via TLS socket
Reading data from agent
Detected transport protocol: TransportProtocol.PLAIN (b'<<')
Closing TCP connection to 10.*.*.*:6556
Write data to cache file /omd/sites/teststg1/tmp/check_mk/cache/pathfinder
Trying to acquire lock on /omd/sites/teststg1/tmp/check_mk/cache/pathfinder
Got lock on /omd/sites/teststg1/tmp/check_mk/cache/pathfinder
Releasing lock on /omd/sites/teststg1/tmp/check_mk/cache/pathfinder
Released lock on /omd/sites/teststg1/tmp/check_mk/cache/pathfinder
[cpu_tracking] Stop [7f995e226190 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=5.300000000745058))]
  Source: SourceInfo(hostname='pathfinder', ipaddress='10.*.*.*', ident='piggyback', fetcher_type=<FetcherType.PIGGYBACK: 4>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f995c70ddd0]
Read from cache: NoCache(pathfinder, path_template=/dev/null, max_age=MaxAge(checking=0.0, discovery=0.0, inventory=0.0), simulation=False, use_only_cache=False, file_cache_mode=1)
[PiggybackFetcher] Execute data source
No piggyback files for 'pathfinder'. Skip processing.
No piggyback files for '10.*.*.*'. Skip processing.
[cpu_tracking] Stop [7f995c70ddd0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
<<<check_mk>>> / Transition NOOPParser -> HostSectionParser
<<<cmk_agent_ctl_status:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<wmi_cpuload:sep(124)>>> / Transition HostSectionParser -> HostSectionParser
<<<df:sep(9)>>> / Transition HostSectionParser -> HostSectionParser
<<<uptime>>> / Transition HostSectionParser -> HostSectionParser
<<<fileinfo:sep(124)>>> / Transition HostSectionParser -> HostSectionParser
<<<mem>>> / Transition HostSectionParser -> HostSectionParser
<<<services>>> / Transition HostSectionParser -> HostSectionParser
<<<logwatch>>> / Transition HostSectionParser -> HostSectionParser
<<<winperf_phydisk>>> / Transition HostSectionParser -> HostSectionParser
<<<winperf_if>>> / Transition HostSectionParser -> HostSectionParser
<<<winperf_processor>>> / Transition HostSectionParser -> HostSectionParser
<<<winperf_tcp_conn>>> / Transition HostSectionParser -> HostSectionParser
<<<winperf_ts_sessions>>> / Transition HostSectionParser -> HostSectionParser
<<<checkmk_agent_plugins_win:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<dotnet_clrmemory:sep(124)>>> / Transition HostSectionParser -> HostSectionParser
<<<ps:sep(9)>>> / Transition HostSectionParser -> HostSectionParser
Transition HostSectionParser -> NOOPParser
<<<winperf_if_teaming:sep(9)>>> / Transition NOOPParser -> HostSectionParser
<<<winperf_if_get_netadapter:sep(124)>>> / Transition HostSectionParser -> HostSectionParser
<<<win_license>>> / Transition HostSectionParser -> HostSectionParser
<<<cmk_update_agent_status:cached(1690816779,3600):sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<win_netstat>>> / Transition HostSectionParser -> HostSectionParser
Transition HostSectionParser -> NOOPParser
<<<systemtime>>> / Transition NOOPParser -> HostSectionParser
No persisted sections
No persisted sections
  HostKey(hostname='pathfinder', source_type=<SourceType.HOST: 1>)  -> Add sections: ['check_mk', 'checkmk_agent_plugins_win', 'cmk_agent_ctl_status', 'cmk_update_agent_status', 'df', 'dotnet_clrmemory', 'fileinfo', 'logwatch', 'mem', 'ps', 'services', 'systemtime', 'uptime', 'win_license', 'win_netstat', 'winperf_if', 'winperf_if_get_netadapter', 'winperf_if_teaming', 'winperf_phydisk', 'winperf_processor', 'winperf_tcp_conn', 'winperf_ts_sessions', 'wmi_cpuload']
  HostKey(hostname='pathfinder', source_type=<SourceType.HOST: 1>)  -> Add sections: []
Received no piggyback data
[cpu_tracking] Start [7f995c74fed0]
value store: synchronizing
Trying to acquire lock on /omd/sites/teststg1/tmp/check_mk/counters/pathfinder
Got lock on /omd/sites/teststg1/tmp/check_mk/counters/pathfinder
value store: loading from disk
Releasing lock on /omd/sites/teststg1/tmp/check_mk/counters/pathfinder
Released lock on /omd/sites/teststg1/tmp/check_mk/counters/pathfinder
CPU utilization      Total CPU: 84.52%
Check_MK Agent       Version: 2.2.0p7, OS: windows, Update error: The agent updater is not registered at the deployment server(!), No successful connect to server yet(!), Agent plugins: 5, Local checks: 0
Disk IO SUMMARY      Read: 164 kB/s, Write: 1.13 MB/s, Latency: 6 milliseconds
Interface Microsoft Hyper-V Network Adapter [Ethernet], (Up), MAC: 00:15:5D:08:E2:08, Speed: 2 GBit/s, In: 892 kBit/s (0.04%), Out: 3.25 MBit/s (0.16%)
LOG Application      No error messages
LOG System           No error messages
Memory and pagefile  RAM: 84.32% - 3.37 GiB of 4.00 GiB, 2 min average: 84.16% (3.37 GiB), Commit charge: 76.18% - 3.62 GiB of 4.75 GiB, 2 min average: 75.56% (3.59 GiB)
Processor Queue      15 min load: 2.92, 15 min load per core: 1.46 (2 logical cores)
Service Summary      Autostart services: 60, Stopped services: 4
Sessions             1 Active, 2 Inactive
System Time          Offset: -38 milliseconds
TCP Connections      Established: 7
Uptime               Up since Jul 14 2023 14:22:39, Uptime: 17 days 3 hours
Windows License      Software is Licensed, License will expire in 174 days 19 hours
fs_C:/               Used: 59.97% - 47.1 GiB of 78.6 GiB, trend per 1 day 0 hours: +288 MiB, trend per 1 day 0 hours: +0.36%, Time left until disk full: 111 days 19 hours
service_Netlogon     Netlogon: running (start type is auto)
service_TermService  Remote Desktop Services: running (start type is demand)
service_WinRM        Windows Remote Management (WS-Management): running (start type is auto)
No piggyback files for 'pathfinder'. Skip processing.
No piggyback files for '10.*.*.*'. Skip processing.
[cpu_tracking] Stop [7f995c74fed0 - Snapshot(process=posix.times_result(user=0.010000000000000009, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.019999999552965164))]
[agent] Success, [piggyback] Success (but no data found for this host), execution time 5.3 sec | execution_time=5.320 user_time=0.010 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=5.300

To sum up: Two almost identical sites on one VM and one site can query a registered host with TLS, while the second site fails with a TLS handshake error.

Any ideas and clues are more than welcome!

Cheers Manuel

We can also see a RequiredEkuNotFound error in /var/log/messages during a connection attempt:

image

1 Like

Did you check your firewall rules about port 8000?

Rg, Christian

Hi Christian,

yes, port 8000 is open and no blocked packages visible on the firewall. The registration is also working fine. I guess the registration wouldn’t work without port 8000?

Cheers Manuel

You can try to remove the registration from the host properties and try it again.

Rg, Christian

You mean cmk-agent-ctl delete-all --enable-insecure-connections on the host and removing TLS in the CMK GUI? I tried this multiple times without success.

The really strange thing is, that it works for the teststg1 site, but not for the stg1 site.
Do you have any idea which files I could compare between both sites?

Cheers Manuel

I think i have a solution for this problem.
Is it possible that you used your own certificate for the site? I mean the certificate inside “~/etc/ssl/sites/…”.

Hey Andreas,
the cert was automatically created, and is signed by our own CA.
But we found a work-around now - we created a new site on every slave, added them to the master and then moved all hosts to the newly created sites.
Registration is working fine now. We suspect, that the old sites have some old config which interferes somehow, as they were updated multiple times (starting from version 1.1 or 1.2 to v2.2 now).
Cheers Manuel

No the problem was your certificate inside the folder i mentioned.
I had yesterday one system with exactly the same error. The site certificate was replaced with one from the own CA.

You’re right! I compared the details of both certs and the not-working one, was signed by our company CA, while the working one was signed locally.

Even if it’s solved for us now , I guess someone else might appreciate, if you could share how to replace the cert with a local one?

Hi,

I have the same error here since I have upgraded the version to 2.1.0

after registering agent TLS I didn´t have communication.

could you help me in this case?

Thanks