We have a local check which is a small powershell script. It seems to run very fast when we test it (many times) from powershell. And for the most part, seems to work just fine in Checkmk. But occasionally, we get:
UNKN - Item not found in agent output
I’d like to either fix this or workaround it so that we still get alerts, but just not for the UNKN state caused by this (or just UNKN, as our powershell will never return that anyhow).
Edit: Let me add, we know about “service state translation”, and maybe that’s a partial answer. It’s just that I don’t want to map UNKN to some state, rather it be mapped to the last state before the UNKN (so as to not notify erroneously).
Hi,
looks like that your script return a value or not for a discovered result. That means, that the Service Name of local script has changed or is missing. It can be missing when the script run in timeout due System load ore something else.
As mentioned, the script cannot return “UNKN”, it’s not in the script. I do not know how feasible it is, but maybe if there was some indication (somewhere) of why the output didn’t come back at all. I’m willing to accept that Windows is completely unreliable… if that’s the answer.
This happens with the actual agent if the script by itself generates some error. The result is a completely empty output. There are two options i configure for my Windows scripts.
All scripts are running asynchron and if needed i configure a retry if some error happens.
The problem with the actual 1.6 agent is that you don’t get error messages from script.
So, I need better help on this one. Let’s say the script has one line… write-host.
And sometimes it works and sometimes checkmk returns Unknown. Anybody have a reasonable guess as to why a local check would come back with no output, even in the simple case of a one liner outputting a constant successful result.
So, some more on this. Sometimes the checkmk “check” on our Windows boxes can take over 2 minutes to complete. While there are some times when the box is under “load” (mind you nothing compared to what our Linux hosts go through), we haven’t isolated a “root cause” where we can say, yes, that’s the time period or event that causes checkmk’s check to take so long.
As I mentioned in a reply, we quickly found out that occasionally the checkmk check is going past timeout. For now, we just increased the timeout for our Windows hosts. We’ll see if that addresses the issue for us.
O that’s a little bit strange as the agent should run every time the same amount of time.
If all plugins are configured to run asynchron there should be no timeout problem for the agent itself.
Notice the screenshot. You can see where occasionally the check took a long time. I think there are too many variables, too many things that can happen client side that could cause the check to take a long time. Anyway, since raising the timeout, we haven’t seen the issue. Not that it happened often anyway, it’s just that when it did happen, it was alerting a lot during the same day.
You should handle the long running tasks by killing the task e.g waitmax on linux. As Andreas mentioned, the system or the resource you call has a problem.
The strange thing is that no pattern is visible. I had systems with agent execution problems from time to time. But there you had every time a pattern (snapshot / backup or anything else).