CMK: Not able to use TLS on numerous Ubuntu Hosts, breaks monitoring

CMK version:
2.1.0.p9 (CMK Raw)
OS version:
Ubuntu Server 20.04.4 LTS (Built myself from ISO)
Ubuntu Server 18.04.6 LTS (Virtual Appliance from OVA)
Error message:
TLS Agent connection will not work with any of my ubuntu hosts when I upgraded their agents from 2.0.something to 2.1.0.p9
RHEL and Alma hosts are having no issues.
I’m having to Remove the TLS registration on the server for my Ubuntu hosts in order to restore monitoring, otherwise only the vCenter host level checks work… All agent based monitoring dies.

Several errors can be observed
On an agent’s status page.
Service: Check_MK

[agent] Host is registered for TLS but not using itCRIT, [piggyback] Successfully processed from source 'vCenter.mydomain.local', Missing monitoring data for plugins: checkmk_agent, df, kernel_util, lnx_if, mem_linux, systemd_units_services_summary, tcp_conn_stats, uptimeWARN, execution time 0.1 sec

Some hosts (but not all will display)
Service: Systemd Service Summary

 service failed (cmk-agent-ctl-daemon)CRIT

If I go to connection tests for that Host
Agent Test:

Host is registered for TLS but not using it<<<esx_vsphere_vm:cached(1660458089,90)>>>
config.datastoreUrl name NFS_Diskstation02-01|accessible true|capacity 17251337441280|freeSpace 7659080781824|maintenanceMode normal|type NFS41|uncommitted 4115428454400|url ds:///vmfs/volumes/b5b7b7b5-219c4d2f-0000-000000000000/
config.guestFullName Ubuntu Linux (64-bit)
config.hardware.device virtualDeviceType VirtualCdrom|label CD/DVD drive 1|summary Remote ATAPI|startConnected false|allowGuestControl true|connected false|status ok@@virtualDeviceType VirtualVmxnet3|label Network adapter 1|summary DVSwitch: 50 15 fe ea 4d 16 69 13-e8 23 38 09 b0 14 05 cd|startConnected true|allowGuestControl true|connected true|status ok
config.hardware.memoryMB 2048
config.hardware.numCPU 2
config.hardware.numCoresPerSocket 1
config.template false
config.uuid 42150d85-f1ae-71ef-4b6c-43f4baca4ffc
config.version vmx-19
guest.toolsVersion 11360
guest.toolsVersionStatus guestToolsUnmanaged
guestHeartbeatStatus green
name vmansible01
runtime.host arthur.MyDomain.local
runtime.powerState poweredOn
summary.guest.hostName vmansible01
summary.quickStats.balloonedMemory 0
summary.quickStats.compressedMemory 0
summary.quickStats.consumedOverheadMemory 40
summary.quickStats.distributedCpuEntitlement 179
summary.quickStats.distributedMemoryEntitlement 931
summary.quickStats.guestMemoryUsage 163
summary.quickStats.hostMemoryUsage 2084
summary.quickStats.overallCpuDemand 179
summary.quickStats.overallCpuUsage 161
summary.quickStats.privateMemory 2044
summary.quickStats.sharedMemory 0
summary.quickStats.staticCpuEntitlement 1086
summary.quickStats.staticMemoryEntitlement 2318
summary.quickStats.swappedMemory 0
summary.quickStats.uptimeSeconds 112368
<<<labels:sep(0)>>>
{"cmk/piggyback_source_vCenter.MyDomain.local": "yes"}

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

I’m now beginning to see the following error persist for Systemd Service Summary of multiple hosts after disabling TLS registration,

Perhaps this is the root issue?
where does the CMK agent store it’s logs?

Systemd Service Summary: Total: 154, Disabled: 2, Failed: 1, 1 service failed (cmk-agent-ctl-daemon)CRIT

Edit: after some digging, I found this post
https://forum.checkmk.com/t/trouble-after-upgrading-to-2-1-agent/31675/15

I’ve run the following commands which seem to fix the issue? Not sure if there are any implications to doing this though…

sudo systemctl stop check_mk.socket
sudo systemctl disable check_mk.socket

I tried to stop the socket, remove check-mk-agent then install 2.1.x but that didn’t do it… had to outright disable it.

could someone confirm if the check_mk.socket is residual from 2.X and is there a cleaner method for removing it then just disabling the service?

Woah, you went full berserk on this. :sweat_smile:

First: Revert all the changes you describe here.
Then take a look it /etc/xinet.d/ there might be a residual configuration file called checkmk or similar.
You can either delete that file or remove xinetd altogether.

If it is not that, double-check your TLS registration process per our official guide: Monitoring Linux - The new agent for Linux in detail

I had purged that file during some of my tests but the issue persisted.
I had registered TLS per the doc you referenced, which usually worked for a few mins until it suddenly breaks. After that I have to disable the check_mk.socket for it to work right.

No issues on AlmaLinux though, just my ubuntu hosts .

the socket appears to be somehow re-enabling itself on atleast one host =/

So, the bandaid stopped working on one host,
How can I go about purging all trace of any binaries, services and configuration files that may be leftover from 2.0.x

I’m thinking a complete purge and then install of 2.1 may be the way to go

In order to get the system linked again I had to do this.

Purge TLS registration for host from WebUI
sudo systemctl stop check_mk.socket
sudo systemctl disable check_mk.socket
sudo apt remove check-mk-agent
sudo apt install ./check-mk-agent_2.1.0p9-1_all.deb
sudo systemctl stop check_mk.socket
sudo systemctl disable check_mk.socket
echo y|sudo cmk-agent-ctl register --hostname vmansible01
–server vmcheckmk01.mydomain.net:8000 --site cmk
–user automation --password MYTOKEN
sudo reboot (To verify it continues to work post-reboot)

If there are leftovers from 2.0 then you will find the systemd unit files inside “/etc/systemd/system”
2.1 unit files are not there anymore.

between these two steps you should check if there is an old unit file left.

Here are the contents of that directory,

Anything you see that should NOT be present with 2.1?
check_mk-async.service
check_mk@.service
check_mk.socket
cloud-final.service.wants
cloud-init.target.wants
dbus-org.freedesktop.resolve1.service
dbus-org.freedesktop.thermald.service
dbus-org.freedesktop.timesync1.service
dcservice.service
default.target.wants
emergency.target.wants
final.target.wants
fwupd-refresh.service
getty.target.wants
graphical.target.wants
iscsi.service
krb5-admin-server.service
krb5-kdc.service
mdmonitor.service.wants
multipath-tools.service
multi-user.target.wants
network-online.target.wants
open-vm-tools.service.requires
paths.target.wants
rescue.target.wants
sleep.target.wants
snap-core20-1587.mount
snap-core20-1593.mount
snap-lxd-21835.mount
snap-lxd-22753.mount
snap.lxd.activate.service
snap.lxd.daemon.service
snap.lxd.daemon.unix.socket
snap-snapd-14978.mount
snap-snapd-16292.mount
sockets.target.wants
sshd-keygen@.service.d
sshd.service
sysinit.target.wants
syslog.service
timers.target.wants
vmtoolsd.service

These three unit files should not be there.
After removing the old agent you can remove these. I would stop the socket before removing the unit.
If you then install the 2.1 agent you should only have unit files inside “/lib/systemd/system/”.
There should be now three

  • check-mk-agent-async.service
  • check-mk-agent@.service
  • check-mk-agent.socket

The port 6556 should be used by the agent controller (cmk-agent-ctl).

That appears to do it, so apparently the following is required before installing 2.1.X from 2.0.x on Debian or Ubuntu.

sudo systemctl stop check_mk.socket
sudo apt purge -y check-mk-agent

Could this possibly be added to the Linux agent troubleshooting page since It doesn’t appear to be one-off issue? Or maybe even have a check for those old check_mk service files added to future 2.1.x installers?

Today i had a small discussion with Tribe staff and submitted the findings for Ubuntu/Debian.
I think that something must happen as the 2.1 agent is not possible to deploy in large environments at the moment. Every deployment needs manual or scripted steps. Agent bakery update is not working this way at the moment.

2 Likes

I think this has long been fixed in:

and

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.