I have a master and 8 slave sites. The master site replicates the configuration to the slaves and the communication between them is encrypted. The setting “Trusted certificate authorities for SSL” is factory default (“Use system wide CA”). It was working many months without issues.
I have faced the issue since few days that the file in ~/var/ssl/ca-certificates.crt on the master gets 0 byte after I activate changes. It looks like the file gets truncated or overwritten and is empty. This causes connections to the slaves to fail because they cannot be verified due to the lack of the certificates. If I manually add the appropriate certificates to the empty file and re-run the activation, the activation works again till next time.
I do not know what is the root cause of the issue. I neither can see why the file gets empty. The permissions are OK. An update from 2.0.0p35 to 2.0.0p37 didn’t improve anything.
The partition has lots of free space.
The file ~/var/log/web.log contains many SSL-related errors that are generated at the activation of changes because of the failed certificate verification but no hint why the file ca-certificates.crt gets empty.
My workaround as of now: removed the write permission for the user of the master site on the directory ~/var/ssl. The activation process displays a warning that the file could not be written, but everything else works fine.
we actually do face the same issues. Since we’ve moved our Checkmk instances from version 1.5 → 1.6 → 2.0 → 2.1 → 2.2.0p7. In version 2.2.0p7 we did enable the TLS option in our distributed sites configuration. Its working initially, but the file ca-certificates.crt gets empty after activating other changes. So we did the same with removing write permissions.
We strangely also are having issues with the Teams Notification:
requests.exceptions.SSLError: HTTPSConnectionPool(host='<companyname>.webhook.office.com', port=443): Max retries exceeded with url: /webhookb2/55af4a75-0c3f-4bfc-88d7-43b93e3d3efb@0e603135-2ea1-4694-89f4-5c1e8703c2d4/IncomingWebhook/aa6f927aa7044d728b8a501c637be493/765ae110-4de5-46ec-83ed-2b5ad45897d3 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:992)')))
Or having issues submitting the license information
Last verification failed (Mode: Manual online verification, Date: 2023-09-12 15:25:32):
[Error] Connection with licensing server (https://license.checkmk.com/api/verify) failed. You need to make sure that your Checkmk can reach the license server. Please check your firewall and proxy settings.
Details: HTTPSConnectionPool(host='license.checkmk.com', port=443): Max retries exceeded with url: /api/verify (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:992)')))
We had to disable TLS for the site connections, in order to prevent disconnections between the sites which are happening after some time by removing write permissions at ~/var/ssl.
We have “Trust system wide configured CAs” ON. But we have already tested setting it to “OFF” and enabling the setting “Checmk specific”. No change.
The point is that CMK trusts in CA, but truncates the file…
We have “Trust system wide configured CAs” ON. But we have already tested setting it to “OFF” and enabling the setting “Checmk specific”. No change.
The point is that CMK trusts in CA, but truncates the file…
We did the same from our end. But no matter what we are trying to setup, the behavior of truncating the file (and keeping it empty afterwards) remains the same.
We are assuming that the upgrade from 1.5/1.6 to 2.X did mess up something with the python setup for the master site (I think there was an upgrade from python2.x to python3.x). Unfortunately, a backup and restore, seems to restore the python libs as well.
Just creating a new master site is also not possible with ~2000 hosts / ~90000 services. If there is a way to extract just this data (hosts and checks), I would be also happy with this solution. Or someone could explain why we are facing this issue & how to solve it.
We will try to update all sites to 2.2.0p11.cee as well soon. I will provide an update afterwards once done…
Is it also possible for you to open a ticket with us and provide the support diagnostics dump?
We have to look at Checkmk logs (maybe we have to raise the log_level) + the underlying OS logs.
On a fresh 2.1.0p11.cee and Even with updating the site from 2.0.0p38 > 2.1.0p34 > 2.2.0p11 , I can’t reproduce this.
does ticket mean sending an email to feedback@checkmk.com? hopefully not, because the dump contains quite sensitive data…
Or will it be generated automatically when I submit the diagnostics data?