Special agent live check status

Hi All,
I just written a new special agent, it seems able to discover the services and perform a service check for the first time. After that, any changes on my services, even though I can see the Checked seconds refresh frequently but the status of the service always remains for the same, it do not update following changes. I can only see the new status after do cmk -vvII hostname and then cmk -vv hostname again.

Below are my example discovery and check function:

def discovery_services(section):
    for data in section:
        svc = ast.literal_eval(data)
        yield (svc['Service'], svc)

def check_services(item, params, parsed):
    status = {"OK": 0, "SLOW": 1, "DEGRADED": 1, "UNRESPONSIVE": 2, "ERROR": 2}
    svc_status = status.get(params['Status'])
    state = {"RUNNING": 0, "STOPPED": 2, "DOWN": 2}
    svc_state = state.get(params['State'])
    if all((zero:= x)==0 for x in [params['CMKStatusCode'], svc_status, svc_state]):
        yield [0, "The APP is OK"]
    else:
        if params['Description'] != '':
            yield [params['CMKStatusCode'], params['Description']]
        if isinstance(svc_status,int):
            yield [svc_status, f"Status {params['Status']}"]
        if isinstance(svc_state,int):
            yield [svc_state, f"State {params['State']}"]

check_info["my_services"] = {
    "parse_function": parse_services,
    "service_description": "%s",
    "inventory_function": discovery_services,
    "check_function": check_services,
}

Your check function

doesn’t use the agent output (parsed) at all. Instead, it seems to determine the status of the check and its output solely from the parameters (params), i.e. from whatever is configured in a WATO plugin.
But since the registration of the service doesn’t refer to any WATO plugin, I doubt this can work.

Also, the check function doesn’t use any item, but is configured as a multi-item service (indicated by the %s in the service description).

hi Dirk,

That I wonder as well but for discovery it worked. For example the result from

curl https://silly-debug.monitoring.svc.cluster.local:5000/status

[{"CMKStatusCode":0,"Description":"","RequestDuration":0.09732,"Service":"app1","State":"RUNNING","Status":"OK"},{"CMKStatusCode":1,"Description":"","RequestDuration":0.021372,"Service":"app2","State":"RUNNING","Status":"DEGRADED"},{"CMKStatusCode":0,"Description":"","RequestDuration":0.195038,"Service":"app3","State":"RUNNING","Status":"OK"},{"CMKStatusCode":2,"Description":"check-mk-agent - Returned HTTP Error 503 Server Error: Service Temporarily Unavailable for url: http://localhost/internal/status ","RequestDuration":0.0,"Service":"app4","State":"","Status":"ERROR"}]

And for the special agent, when I ran:

cmk --debug -vvII localhost

Discovering services and host labels on: localhost
localhost:
+ FETCHING DATA
  Source: SourceInfo(hostname='localhost', ipaddress=None, ident='special_my_services', fetcher_type=<FetcherType.SPECIAL_AGENT: 6>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f434de09f50]
Read from cache: AgentFileCache(localhost, path_template=/omd/sites/la/tmp/check_mk/data_source_cache/special_my_services/{hostname}, max_age=MaxAge(checking=0, discovery=90.0, inventory=90.0), simulation=False, use_only_cache=False, file_cache_mode=1)
[ProgramFetcher] Execute data source
Calling: /omd/sites/la/share/check_mk/agents/special/agent_my_services --protocol https --user anonymouse --password '******' --services services --port 5000 --instance silly-debug.monitoring.svc.cluster.local/status --environment dev --site la
[cpu_tracking] Stop [7f434de09f50 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.33, children_system=0.13, elapsed=0.8899999996647239))]
  Source: SourceInfo(hostname='localhost', ipaddress=None, ident='piggyback', fetcher_type=<FetcherType.PIGGYBACK: 4>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f434de08090]
Read from cache: NoCache(localhost, path_template=/dev/null, max_age=MaxAge(checking=0.0, discovery=0.0, inventory=0.0), simulation=False, use_only_cache=False, file_cache_mode=1)
[PiggybackFetcher] Execute data source
No piggyback files for 'localhost'. Skip processing.
[cpu_tracking] Stop [7f434de08090 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
<<<my_services:sep(0)>>> / Transition NOOPParser -> HostSectionParser
Transition HostSectionParser -> NOOPParser
  HostKey(hostname='localhost', source_type=<SourceType.HOST: 1>)  -> Add sections: ['my_services']
  HostKey(hostname='localhost', source_type=<SourceType.HOST: 1>)  -> Add sections: []
Received no piggyback data
+ ANALYSE DISCOVERED HOST LABELS
Trying host label discovery with: my_services
Trying host label discovery with: 
SUCCESS - Found no host labels
+ ANALYSE DISCOVERED SERVICES
+ EXECUTING DISCOVERY PLUGINS (1)
  Trying discovery with: my_services
  4 my_services
SUCCESS - Found 4 services

And the check

cmk --debug -vv localhost

Checkmk version 2.2.0p27
+ FETCHING DATA
  Source: SourceInfo(hostname='localhost', ipaddress=None, ident='special_my_services', fetcher_type=<FetcherType.SPECIAL_AGENT: 6>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f8b95dde210]
Read from cache: AgentFileCache(localhost, path_template=/omd/sites/la/tmp/check_mk/data_source_cache/special_my_services/{hostname}, max_age=MaxAge(checking=0, discovery=90.0, inventory=90.0), simulation=False, use_only_cache=False, file_cache_mode=6)
Not using cache (Too old. Age is 58 sec, allowed is 0 sec)
[ProgramFetcher] Execute data source
Calling: /omd/sites/la/share/check_mk/agents/special/agent_my_services --protocol https --user anonymouse --services services --port 5000 --instance silly-debug.monitoring.svc.cluster.local/status --environment dev --site la
Write data to cache file /omd/sites/la/tmp/check_mk/data_source_cache/special_my_services/localhost
Trying to acquire lock on /omd/sites/la/tmp/check_mk/data_source_cache/special_my_services/localhost
Got lock on /omd/sites/la/tmp/check_mk/data_source_cache/special_my_services/localhost
Releasing lock on /omd/sites/la/tmp/check_mk/data_source_cache/special_my_services/localhost
Released lock on /omd/sites/la/tmp/check_mk/data_source_cache/special_my_services/localhost
[cpu_tracking] Stop [7f8b95dde210 - Snapshot(process=posix.times_result(user=0.009999999999999787, system=0.0, children_user=0.36, children_system=0.11, elapsed=1.0299999993294477))]
  Source: SourceInfo(hostname='localhost', ipaddress=None, ident='piggyback', fetcher_type=<FetcherType.PIGGYBACK: 4>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f8b9530b390]
Read from cache: NoCache(localhost, path_template=/dev/null, max_age=MaxAge(checking=0.0, discovery=0.0, inventory=0.0), simulation=False, use_only_cache=False, file_cache_mode=1)
[PiggybackFetcher] Execute data source
No piggyback files for 'localhost'. Skip processing.
[cpu_tracking] Stop [7f8b9530b390 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
<<<my_services:sep(0)>>> / Transition NOOPParser -> HostSectionParser
Transition HostSectionParser -> NOOPParser
  HostKey(hostname='localhost', source_type=<SourceType.HOST: 1>)  -> Add sections: ['my_services']
  HostKey(hostname='localhost', source_type=<SourceType.HOST: 1>)  -> Add sections: []
Received no piggyback data
[cpu_tracking] Start [7f8b95bc8090]
value store: synchronizing
Trying to acquire lock on /omd/sites/la/tmp/check_mk/counters/localhost
Got lock on /omd/sites/la/tmp/check_mk/counters/localhost
value store: loading from disk
Releasing lock on /omd/sites/la/tmp/check_mk/counters/localhost
Released lock on /omd/sites/la/tmp/check_mk/counters/localhost
value store: synchronizing
Trying to acquire lock on /omd/sites/la/tmp/check_mk/counters/localhost
Got lock on /omd/sites/la/tmp/check_mk/counters/localhost
value store: already loaded
Releasing lock on /omd/sites/la/tmp/check_mk/counters/localhost
Released lock on /omd/sites/la/tmp/check_mk/counters/localhost
app1                 The APP is OK
app2                 Status DEGRADED(!), State RUNNING
app3                 The APP is OK
app4                 check-mk-agent - Returned HTTP Error 503 Server Error: Service Temporarily Unavailable for url: http://localhost/internal/status(!!), Status ERROR(!!)
No piggyback files for 'localhost'. Skip processing.
[cpu_tracking] Stop [7f8b95bc8090 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.010000000707805157))]
[special_my_services] Success, [piggyback] Success (but no data found for this host), execution time 1.0 sec | execution_time=1.040 user_time=0.010 system_time=0.000 children_user_time=0.360 children_system_time=0.110 cmk_time_ds=0.550 cmk_time_agent=0.000

Looked like the check function “check_services” never being called when I executed the command caused the issue?

cmk --debug -vv localhost

Here you save the current data inside the parameters for this check.
Normally this should be something like

yield (svc['Service'], None)
1 Like

Nice finding. I didn’t notice that. It means that the check function gets called with the params that were set by the discovery function at the time of the discovery. And the current agent output (as obtained every minute or so) isn’t used at all by the check function.

The check function should do something like this:

def check_services(item, params, parsed):

    for data in parsed:
        if data["Service"] == item:
            # now do your checks with data and yield, yield, yield
            # them and then return:
            yield (0, "...")
            return
2 Likes

Is this from the discovery function or check function? My section param from the discovery contained a dict:

{ 
  "{'CMKStatusCode': 0, 'Description': '', 'RequestDuration': 0.09732, 'Service': 'app1', 'State': 'RUNNING', 'Status': 'OK'}": [], 
  "{'CMKStatusCode': 1, 'Description': '', 'RequestDuration': 0.021372, 'Service': 'app2', 'State': 'RUNNING', 'Status': 'DEGRADED'}": [], 
  "{'CMKStatusCode': 0, 'Description': '', 'RequestDuration': 0.195038, 'Service': 'app3', 'State': 'RUNNING', 'Status': 'OK'}": [], 
  "{'CMKStatusCode': 2, 'Description': 'check-mk-agent - Returned HTTP Error 503 Server Error: Service Temporarily Unavailable for url: http://localhost/internal/status ', 'RequestDuration': 0.0, 'Service': 'app4', 'State': '', 'Status': 'ERROR'}": []
}

if loop through the section and yield as below then my check item params contain only app1, app2, app3, app4 and for params is empty.

yield (svc['Service'], None)

Hi Dirk,

The parsed is a string so I have to convert it to a dict before can try data[“Service”], is this normal when the parsed input is a string by default instead of a dict?

The line is in your discovery function.

These are no params but the data you get.

I think you mix here something. Parameters are only things like a state you want to remember at discovery time, like the interface has the status “up” at discovery time.
Then your check can see if the state changes.

What is missing in your code is your parsing section.

This conversation should be done by the parse function and not inside the discovery function.

What will also be important is how does your section looks like before any parsing happens.
The header of the section is here the relevant part.

2 Likes

got it, thanks @Dirk and @andreas-doehler Adnreas, I managed to fix this by updating the discovery and check function based on what you and Dirk suggestion.

1 Like