Good day all,
I am seeking advice on a request from a customer. It’s not a straight forward request and I am hoping someone here has dealt with this before and can point me in the right direction before I reinvent the wheel.
Scenario:
-2.3 Enterprise single site
-Customer has a large network with 2,300 critical distribution links.
-These links have service labels already assigned to them used to target them for NOC notifications.
-These links need to send error, bandwidth, operational, and more alerts to the network team and just down/up (nothing else) to the NOC.
-The NOC needs to get an alert even if there is CRIT state on the interface before it goes down (eg: CRIT error rate, then interface goes down). This would not generate a notification since it’s not a state change to CheckMK
-I can’t disable the bandwidth, error rate, and other thresholds for these services. The network team needs to get alerts from these thresholds to know if they need to upgrade a link or may have a degraded link.
I thought about using the “Match check plug-in output” filter in the notification rule. It would work for OK|WARN > CRIT if the interface went down using .*\(down\) as the match. This would not alert on the CRIT > CRIT case mentioned above. For the clear, I realize that adding .*\(up\) won’t work because if a bandwidth or other alarm clears (ANY > OK), it will have the “up” string in it causing clears to go to the NOC for non-operstatus issues which the client said cannot happen. The NOC must only get the very targeted alarms and clears while the networking team gets all alarms and clears.
My next thought was a custom notification script. I would need to be able to compare the last state change service output against the current to see if the operstate changed. That means storing up to 2,300 temp files and parsing the appropriate one on each run of the custom plugin. This is possible, but I do worry about system load and file maintenance, and long term support. This also lacks the CRIT>CRIT notification because CheckMK doesn’t consider that a state change, so notification rules would not be parsed.
My next thought was a “meta” check plugin that creates 2,300 new services that use LQ to check the service output of their respective interface for up/down state. This method would not care if the if64-based service is in non-OK for errors like bandwidth breaches and would only go CRIT if the ‘down’ string is detected in the output. That’s a logistical nightmare because we are wasting a large portion of their license on these meta services, and adding a lot of load to the server running 2,300 LQ queries every minute. Additionally, reassigning the service lables would be a large task, but it is technically possible.
As I was writing this I was thinking of the explicit SNMPGET active check against the ifOperStatus of each interface, but again it is 2,300 new checks with the load and management issues. I’d also have to figure out how to bulk add all these pure SNMP active checks. Any changes to their network topology would not be scalable and must be updated manually.
I know this is a tough ask. This client has not had any significant work for us to do in a while, so I want to come through for them. Any thoughts are appreciated!