I’m currently playing around with bulk notifications and observed something surprising:
I got a NOTIFY_LONGSERVICEOUTPUT with ~ 150K of characters.
In global settings we have a configured “Maximum long output size” of 6000 Bytes.
Checkmk version is 2.3.0p39 (Enterprise Edition).
What should be the expected behaviour?
full output
output truncated to whatever is set in global settings
Also interesting: for regular notifications everything is passed as an environment variable. When the notification plugin is called all arguments and environment variables should be limited by the operation system. On our Main instance this is currently
OMD[Main]:~$ getconf ARG_MAX
2097152
So this would be an upper limit for all data that is passed to a notification in non-bulk mode.
The limit is calculated from pagesize. This is usually 4K which gives us 64K (65536). Adding “…\nAttention: Removed remaining content because it was too long.” gives us 65601 characters.
I guess the appended text contains a bug: the “\n” for the newline should be escaped as all other newlines in the output.
Strange is also the fallback calculation which uses 4046 instead of 4096 as pagesize. Maybe someone from the Checkmk team can tell us why they use 4046.
One question - what was the real problem in your case?
There are multiple answers to this question. So what brought me there:
We suffer from very slow notifications because of our large configuration. Each notification has a 4-5s delay when Checkmk loads the complete site configuration although it only needs the notification configuration. When there are larger networks hickups it takes hours until all notifications are processed. We were told to wait for 2.5 which should improve notification speed.
Because our regular notification scripts do not support bulk mode I’m trying to write a generic wrapper that creates the necessary environment and calls the regular notification script (checkmk-extensions/generic_bulk_wrapper/src/notifications/generic_bulk_wrapper.py at main · mayrstefan/checkmk-extensions · GitHub). It’s only purpose is to avoid the previously described performance bottleneck. Occasionally I got an OSError 7 exception because tried to create an environment which was too large. It worked in non-bulk mode but it sometimes failed in bulk-mode which showed me that there is a difference
I’m trying to understand: what is specified, what is documented and how do these informations relate to each other (or not). The problem with undocumented behaviour is that you work with assumptions that may be wrong. And sooner or later something will break.
The slow notification issue is something old we can find for example in Spooled Notifications are to slow (only 1 notif per sec) and maybe Check_MK slow sending notifications was also related to that. Support told us to reduce notifications but I don’t see how to do this: the more configuration objects you have the slower the notification gets. The more things you monitor the higher the probility something will change state. So a growing site means more notifications and they will get slower. We’ll see what we will get with 2.5.
The 150kByte mail or something else?
The plugin with the 150K output was the Checkmk builtin NetScaler SNMP plugin.
I don’t know what is planned for 2.5 but in 2.3 & 2.4 , if i had this problem, i tried to switch to “Enable synchronous delivery via SMTP" if possible. The biggest problem at the moment is the spooler. If you have a huge amount of notifications then it does not work, there you are right.
Also in very big environments i try to handle the notifications directly on the distributed notes and don’t use the notification forwarding.
Our notifications are no mails. We forward our notifications as some sort of pseudo XML data with its own protocol to our central event management systemen. So we need our own notification scripts to get it formated right before we can pass it to a cli command to send it the central system.
Also we already use the spoolers on each site to decouple things.