Main Site in distributed Setup unresponsive and mknotifyd Problems

Hello Community,
we have a distributed setup with notifications send to main site. Today our main site was unresponsive and we got timeouts from apache. This has happened before and now the only solution was to completely reboot the VM the site is running. Only stop mknotifyd separately and then restart the main site doesn’t help. I also cleaned the notify spool folder, where some deferred and corrupted spool files.

Restarting mknotifyd only didn`t help as this seems to be hanging in aquiring lock for multisite.mk.

In mknotifyd.log I see several entries with exit code -15:

2024-05-31 09:33:16,101 [20] [cmk.mknotifyd] processing spoolfile f86b3d34-ecf0-4de6-9b46-4c79dc4fe5f5 failed (temporary): 
2024-05-31 09:33:16,101 [20] [cmk.mknotifyd] running cmk --notify --log-to-stdout spoolfile /omd/sites/main/var/check_mk/notify/spool/6a0384c6-b7c8-4847-bc10-6a16f43e791f
2024-05-31 09:34:16,195 [30] [cmk.mknotifyd] process for spoolfile 6a0384c6-b7c8-4847-bc10-6a16f43e791f timed out after 60 seconds
2024-05-31 09:34:16,201 [20] [cmk.mknotifyd] got exit code -15

After restart of mknotifyd before cleaning old Notifications I find this in log:

2024-05-31 10:15:54,002 [40] [cmk.mknotifyd] Corrupted spoolfile /omd/sites/main/var/check_mk/notify/spool/58066740-05a2-4684-b2dc-51488348e2df:
Traceback (most recent call last):
  File "/omd/sites/main/lib/python3/cmk/cee/mknotifyd/spool_handler.py", line 71, in _process_spool_directory
    content: Message = evaluate_file(spoolfile_path)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/cee/mknotifyd/utils.py", line 25, in evaluate_file
    return literal_eval(f.read())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/ast.py", line 66, in literal_eval
    node_or_string = parse(node_or_string.lstrip(" \t"), mode='eval')
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/ast.py", line 52, in parse
    return compile(source, filename, mode, flags,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<unknown>", line 0
    
SyntaxError: invalid syntax

Is someone else having something like this? What else could I do to narrow this down?

Hi Andreas,

if you still have the corrupt spool files i would inspect those and see what the content of these files is.
Maybe that leads to the root cause.
There has been a fix regarding corrupt spool files but that is quite old and I assume that you have a newer version.

Hi Andreas,

it would be helpful to know which version you are using.

In our 2.2.0p26.cme we have the same messages

notify.log:2024-05-29 06:30:24,355 [20] [cmk.base.notify] Creating spoolfile: /omd/sites/site/var/check_mk/notify/spool/07d436a4-fddd-4298-8d57-bb601fd3416a

mknotifyd.log:2024-05-29 06:30:27,445 [40] [cmk.mknotifyd] Corrupted spoolfile /omd/sites/site/var/check_mk/notify/spool/07d436a4-fddd-4298-8d57-bb601fd3416a:
Traceback (most recent call last):
  File "/omd/sites/site/lib/python3/cmk/cee/mknotifyd/spool_handler.py", line 71, in _process_spool_directory
    content: Message = evaluate_file(spoolfile_path)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/site/lib/python3/cmk/cee/mknotifyd/utils.py", line 25, in evaluate_file
    return literal_eval(f.read())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/site/lib/python3.11/ast.py", line 64, in literal_eval
    node_or_string = parse(node_or_string.lstrip(" \t"), mode='eval')
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/site/lib/python3.11/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<unknown>", line 0

SyntaxError: invalid syntax
mknotifyd.log:2024-05-29 06:30:31,273 [15] [cmk.mknotifyd] processing spoolfile: /omd/sites/site/var/check_mk/notify/spool/07d436a4-fddd-4298-8d57-bb601fd3416a
mknotifyd.log:2024-05-29 06:30:31,278 [20] [cmk.mknotifyd] running cmk --notify --log-to-stdout spoolfile /omd/sites/site/var/check_mk/notify/spool/07d436a4-fd                                                           dd-4298-8d57-bb601fd3416a
mknotifyd.log:2024-05-29 06:30:35,060 [20] [cmk.mknotifyd] processing spoolfile 07d436a4-fddd-4298-8d57-bb601fd3416a successful: Successfully send event to mail       

Surprisingly, mknotify reports then that the message was successfully delivered and in the “corrupted” directory is only an empty file:

OMD[sites]:~/var/check_mk/notify/corrupted$ ls -l 07d436a4-fddd-4298-8d57-bb601fd3416a
-rw------- 1 sites sites 0 2024-05-29 06:30:24 07d436a4-fddd-4298-8d57-bb601fd3416a

I can’t say for sure whether the message actually arrived in the end.

Unfortunately, I don’t know what is causing this problem either. In this particular case the message contained a lot of data (10k) in the check details, maybe that was too much for Checkmk, who knows.

Regards, Lars

2 Likes

Hi Andre, Hi Lars,
we are using 2.3.0p3 in our Setup, sorry I missed that in m original post.
I archived the notifications before deleting them, so I will have a look inside, is there something special I should have a look for?
@Lars: Is your site getting unresponsive when you have this kind of corrupted notifications?

Regard,
Andreas

We’ve only had a handful of these messages about corrupted spool files, and I’ve only seen them because I was explicitly looking for them after your post. Aside from the fact that we’ve been getting more warnings about failed notifications lately, we haven’t noticed any unusual behavior.

@LaSoe: I did some recheck with the corrupted spool files and didn’t find any problems with them, also putting them into the spool directory will do notification of them.

Unfortunately, we are currently affected by a hanging site again and are still trying to figure out the root cause. It seems to be related to the multisite.mk and the licensing, the apache process is waiting to acquire a lock. Probably the notification is a side-effect of this.

One of the last things we found in web.log is this:

024-06-16 16:55:02,388 [40] [cmk.web 2408411] Exception (kd1284, discovered_host_label_sync)
Traceback (most recent call last):
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/host_label_sync.py", line 229, in _execute_site_sync
    raw_result = do_remote_automation(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/automations.py", line 349, in do_remote_automation
    serialized_response = _do_remote_automation_serialized(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/automations.py", line 297, in _do_remote_automation_serialized
    response = get_url(
               ^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/automations.py", line 514, in get_url
    return get_url_raw(url, insecure, auth, data, files, timeout).text
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/automations.py", line 387, in get_url_raw
    "x-checkmk-license-state": get_license_state().readable,
                               ^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/utils/licensing/registry.py", line 73, in get_license_state
    return _make_licensing_handler().state
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/utils/licensing/registry.py", line 49, in _make_licensing_handler
    return _get_licensing_handler().make()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/utils/cee/licensing/cce_handler.py", line 882, in make
    state_file_creation_time = read_creation_time(omd_root)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/utils/cee/licensing/init_state_creation.py", line 27, in read_creation_time
    with locked(state_file_path):
  File "/omd/sites/main/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/utils/store/_locks.py", line 144, in locked
    acquired = acquire_lock(path, blocking)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/utils/store/_locks.py", line 171, in acquire_lock
    with _open_lock_file(path) as fd_new:
  File "/omd/sites/main/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/omd/sites/main/lib/python3/cmk/utils/store/_locks.py", line 186, in _open_lock_file
    os.close(fd)
OSError: [Errno 9] Bad file descriptor
2024-06-16 16:55:02,557 [40] [cmk.web 2408411] Exception (kd4426, discovered_host_label_sync)
Traceback (most recent call last):
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
                ^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/ssl.py", line 455, in wrap_socket
    return self.sslsocket_class._create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/ssl.py", line 960, in _create
    if sock.getsockopt(SOL_SOCKET, SO_TYPE) != SOCK_STREAM:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 88] Socket operation on non-socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/omd/sites/main/lib/python3.12/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
                ^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/ssl.py", line 455, in wrap_socket
    return self.sslsocket_class._create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/ssl.py", line 960, in _create
    if sock.getsockopt(SOL_SOCKET, SO_TYPE) != SOCK_STREAM:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError(88, 'Socket operation on non-socket'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/host_label_sync.py", line 229, in _execute_site_sync
    raw_result = do_remote_automation(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/automations.py", line 349, in do_remote_automation
    serialized_response = _do_remote_automation_serialized(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/automations.py", line 297, in _do_remote_automation_serialized
    response = get_url(
               ^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/automations.py", line 514, in get_url
    return get_url_raw(url, insecure, auth, data, files, timeout).text
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3/cmk/gui/watolib/automations.py", line 377, in get_url_raw
    response = requests.post(
               ^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/main/lib/python3.12/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', OSError(88, 'Socket operation on non-socket'))
...

Is someone else with a MSP Edition also having this or does someone have an idea what to look for? Our impression is it got worse after adding more customers and sites.

Kind Regards
Andreas

Yes, We have the same problem, also MSP Edition
upgraded from ubuntu20.04 with checkmk 2.2.0.x to a host running debian 2.3.0p6

The Same error in the Logs:

2024-06-24 00:24:01,989 [20] [cmk.mknotifyd] processing spoolfile eb5fa6ad-06d2-4b4b-998b-4f55ca0611e7 failed (temporary): Delivery attempt number 3 timed out after 60s
2024-06-24 00:24:01,989 [20] [cmk.mknotifyd] maximum number of retries for eb5fa6ad-06d2-4b4b-998b-4f55ca0611e7 reached, giving up
2024-06-24 00:24:01,989 [20] [cmk.mknotifyd] running cmk --notify --log-to-stdout spoolfile /omd/sites/multi/var/check_mk/notify/deferred/04d85efc-d250-4ed3-8541-9c8b35f373c4
2024-06-24 00:25:02,065 [30] [cmk.mknotifyd] process for spoolfile 04d85efc-d250-4ed3-8541-9c8b35f373c4 timed out after 60 seconds
2024-06-24 00:25:02,070 [20] [cmk.mknotifyd] got exit code -15
2024-06-24 00:25:02,071 [20] [cmk.mknotifyd] processing spoolfile 04d85efc-d250-4ed3-8541-9c8b35f373c4 failed (temporary): Delivery attempt number 3 timed out after 60s
2024-06-24 00:25:02,071 [20] [cmk.mknotifyd] maximum number of retries for 04d85efc-d250-4ed3-8541-9c8b35f373c4 reached, giving up
2024-06-24 00:29:51,955 [20] [cmk.mknotifyd] -----------------------------------------------------------------
2024-06-24 00:29:51,955 [20] [cmk.mknotifyd] Check_MK Notification Spooler version 2.3.0p6 starting
2024-06-24 00:29:51,956 [20] [cmk.mknotifyd] Log verbosity: 0
2024-06-24 00:29:51,959 [20] [cmk.mknotifyd] Daemonized with PID 273637.
2024-06-24 00:29:51,960 [40] [cmk.mknotifyd] ERROR: %s
Traceback (most recent call last):
  File "/omd/sites/multi/lib/python3/cmk/cee/mknotifyd/main.py", line 305, in main
    with (
  File "/omd/sites/multi/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/omd/sites/multi/lib/python3/cmk/utils/daemon.py", line 104, in pid_file_lock
    lock_with_pid_file(path)
  File "/omd/sites/multi/lib/python3/cmk/utils/daemon.py", line 80, in lock_with_pid_file
    raise MKGeneralException(
cmk.utils.exceptions.MKGeneralException: Failed to acquire PID file lock: Another process is already running

After that the Site in unresponsive. Only a forced killl of the apache2 processes help

Hi Andreas, today I’ve noticed this failure in our web.log too. But we have no MSP Edition, we are running Enterprise Edition on redhat servers. Did you solve this issue somehow?
“2024-09-02 14:44:01,684 [40] [cmk.web 1106996] Exception (site2, discovered_host_label_sync)”
“OSError: [Errno 88] Socket operation on non-socket”

regards
Christian

Hi Christian,
on our site this was more or less a side issue, as our main Problem was a completely hanging main site and only killing the site got it back to a normal state.
We noticed this very frequently with the Version 2.3.0p6. We than had some sessions with checkmk support and the main reasons for this were fixed with this: Werk #17082: Fixed another instance of hanging processes.
Updating to 2.3.0p11 fixed the unavailability of the site and I just checked our logs and the issues with discovered_host_label_sync are also gone.
If you are using a version 2.3.0 with a patch level lower than p11 you should probably update.

Kind regards
Andreas

Hi,
thanks for the feedback. In addition to the error mentioned, we also have a lot of errors of the type bad file descriptor. This is apparently fixed in this work Werk #17080: Hanging background jobs/frozen site. This werk is also mentioned in your link. Currently we are on cee 2.2.0p21. So I think we should go to at least 2.2.0p28.
Thank you!