Local Windows check, we sometimes get "UNKN - Item not found in agent output"

Running 1.6.0p16 CEE.

We have a local check which is a small powershell script. It seems to run very fast when we test it (many times) from powershell. And for the most part, seems to work just fine in Checkmk. But occasionally, we get:

UNKN - Item not found in agent output

I’d like to either fix this or workaround it so that we still get alerts, but just not for the UNKN state caused by this (or just UNKN, as our powershell will never return that anyhow).

Edit: Let me add, we know about “service state translation”, and maybe that’s a partial answer. It’s just that I don’t want to map UNKN to some state, rather it be mapped to the last state before the UNKN (so as to not notify erroneously).

Hi,
looks like that your script return a value or not for a discovered result. That means, that the Service Name of local script has changed or is missing. It can be missing when the script run in timeout due System load ore something else.

As mentioned, the script cannot return “UNKN”, it’s not in the script. I do not know how feasible it is, but maybe if there was some indication (somewhere) of why the output didn’t come back at all. I’m willing to accept that Windows is completely unreliable… if that’s the answer.

This happens with the actual agent if the script by itself generates some error. The result is a completely empty output. There are two options i configure for my Windows scripts.
All scripts are running asynchron and if needed i configure a retry if some error happens.

The problem with the actual 1.6 agent is that you don’t get error messages from script.

Thanks, I’ll try to do some exception/looping (but not forever) on my end and see.

So, I need better help on this one. Let’s say the script has one line… write-host.

And sometimes it works and sometimes checkmk returns Unknown. Anybody have a reasonable guess as to why a local check would come back with no output, even in the simple case of a one liner outputting a constant successful result.

So, some more on this. Sometimes the checkmk “check” on our Windows boxes can take over 2 minutes to complete. While there are some times when the box is under “load” (mind you nothing compared to what our Linux hosts go through), we haven’t isolated a “root cause” where we can say, yes, that’s the time period or event that causes checkmk’s check to take so long.

That would be strange. If you have only a write-host this should work everytime.
How is you default agent settings configured?

I had only one systems with problems with plugin execution, there it was only the agent updater blocked by the anti virus software :smiley:

I add my default config if i have no agent bakery only as reference if you want to compare.
Such config runs on some thousand hosts in my systems.

global:
    only_from: MON-HOST-IP
    async: yes
    sections: 
        - check_mk 
        - spool 
        - plugins
        - local
        - winperf 
        - uptime 
        - systemtime 
        - df 
        - mem 
        - services 
        - msexch
        - dotnet_clrmemory
        - wmi_webservices
        - wmi_cpuload
        - ps 
        - fileinfo 
        - logwatch 

ps:
    enabled: yes
    use_wmi: yes
    # full_path: yes # only if needed

winperf:
    enabled: yes
    counters:
        - 638: tcp_conn
        - Terminal Services: ts_sessions

logwatch:
    enabled: yes
    logfile:
        - 'Application': warm nocontext
        - 'System': warn nocontext
        - '*': off

plugins:
    enabled: yes
    execution:
        - pattern     : '$BUILTIN_PLUGINS_PATH$\windows_updates.vbs'
          timeout     : 3600
          async       : yes
          cache_age   : 90000
          run         : yes

        - pattern     : '$BUILTIN_PLUGINS_PATH$\mk_inventory.vbs'
          async       : yes
          run         : yes

        - pattern     : '$BUILTIN_PLUGINS_PATH$\windows_if.ps1'
          async       : yes
          run         : yes

        - pattern     : '$BUILTIN_PLUGINS_PATH$\windows_tasks.ps1'
          async       : yes
          run         : yes

        - pattern     : '$CUSTOM_PLUGINS_PATH$\*.*'
          async       : yes
          timeout     : 30
          run         : yes

        - pattern     : '$BUILTIN_PLUGINS_PATH$\*.*'
          timeout     : 30
          run         : no

        - pattern     : '*'
          run         : no

local:
    enabled: yes
    execution:
        - pattern     : '*.*'
          run         : yes

As I mentioned in a reply, we quickly found out that occasionally the checkmk check is going past timeout. For now, we just increased the timeout for our Windows hosts. We’ll see if that addresses the issue for us.

O that’s a little bit strange as the agent should run every time the same amount of time.
If all plugins are configured to run asynchron there should be no timeout problem for the agent itself.

Notice the screenshot. You can see where occasionally the check took a long time. I think there are too many variables, too many things that can happen client side that could cause the check to take a long time. Anyway, since raising the timeout, we haven’t seen the issue. Not that it happened often anyway, it’s just that when it did happen, it was alerting a lot during the same day.

There must be some type of problem on this machine itself. But i have no systems with problems like this.

You should handle the long running tasks by killing the task e.g waitmax on linux. As Andreas mentioned, the system or the resource you call has a problem.

Yes, but the problem is Windows under load. Most people don’t put their Windows (Windows 10 in this case) under any “real” load, just my opinion.

The strange thing is that no pattern is visible. I had systems with agent execution problems from time to time. But there you had every time a pattern (snapshot / backup or anything else).

yeah, the variable is user load (likely) in our case.