RobotMK plugin causing checkMK agent to timeout

TheodoreG · October 11, 2023, 11:26am

I’ve configured the RobotMK plugin to run some robot framework suites on a desktop machine, to perform end to end checks of some websites we host.

When one or two of the suites are enabled, everything works as expected. When three or more suites are enabled, the checkmk agent times out completely, and we are unable to receive any updates for any service on this machine (related to the plugin or not).

The robot framework tests are still being run by the checkmk agent it would seem, as the logs for these are present and look as expected. I’m unable to see anything incriminating in the agent logs either. My assumption is the agent and plugin are running as expected, but are taking too long, causing the checkmk agent service to timeout and not receive any data.

I have attempted to modify cache and timeout settings in both the plugin configuration (robotmk.yml) and checkmk_mk.user.yml, but so far this does not appear to have any impact.

CMK version: 2.1.0p10
OS version: Windows 10 22H2

Error message: (Service Check Timed Out)

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

Checkmk version 2.1.0p10
Try license usage history update.
Trying to acquire lock on /omd/sites/redacted/var/check_mk/license_usage/next_run
Got lock on /omd/sites/redacted/var/check_mk/license_usage/next_run
Trying to acquire lock on /omd/sites/redacted/var/check_mk/license_usage/history.json
Got lock on /omd/sites/redacted/var/check_mk/license_usage/history.json
Next run time has not been reached yet. Abort.
Releasing lock on /omd/sites/redacted/var/check_mk/license_usage/history.json
Released lock on /omd/sites/redacted/var/check_mk/license_usage/history.json
Releasing lock on /omd/sites/redacted/var/check_mk/license_usage/next_run
Released lock on /omd/sites/redacted/var/check_mk/license_usage/next_run
+ FETCHING DATA
  Source: SourceType.HOST/FetcherType.PIGGYBACK
[cpu_tracking] Start [7f0fc06aefd0]
[PiggybackFetcher] Fetch with cache settings: NoCache(robot-desktop, base_path=/omd/sites/redacted/tmp/check_mk/data_source_cache/piggyback, max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=True, use_outdated=False, simulation=False)
Not using cache (Cache usage disabled)
[PiggybackFetcher] Execute data source
No piggyback files for 'robot-desktop'. Skip processing.
No piggyback files for '192.168.0.xx'. Skip processing.
Not using cache (Cache usage disabled)
[cpu_tracking] Stop [7f0fc06aefd0 - Snapshot(process=posix.times_result(user=0.010000000000000009, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
  Source: SourceType.HOST/FetcherType.PIGGYBACK
No persisted sections
  -> Add sections: []
Received no piggyback data
[cpu_tracking] Start [7f0fc06aecd0]
value store: synchronizing
Trying to acquire lock on /omd/sites/redacted/tmp/check_mk/counters/robot-desktop
Got lock on /omd/sites/redacted/tmp/check_mk/counters/robot-desktop
value store: loading from disk
Releasing lock on /omd/sites/redacted/tmp/check_mk/counters/robot-desktop
Released lock on /omd/sites/redacted/tmp/check_mk/counters/robot-desktop
No piggyback files for 'robot-desktop'. Skip processing.
No piggyback files for '192.168.0.xx'. Skip processing.
[cpu_tracking] Stop [7f0fc06aecd0 - Snapshot(process=posix.times_result(user=0.010000000000000009, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.02000000048428774))]
execution time 0.0 sec | execution_time=0.020 user_time=0.020 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.000

simonm · October 11, 2023, 12:33pm

Can you show me the first graph of the “Robotmk” service on the respective host?
Regards,
Simon

TheodoreG · October 11, 2023, 1:44pm

I’ve enabled a single suite and restarted the agent. Here’s what the robotmk service graph looks like under those conditions.

If I enable more suites then the agent failing means none of the services update, so I can’t provide graphs/data from checkmk for this.

The cache time of the plugin is currently set to 180 seconds, with the single enabled suite pictured in that graph taking ~16 seconds to run.

simonm · October 11, 2023, 2:06pm

Ok, when you summarize the runtimes of the other suites, what total runtime do you get?
It should be lower than the execution interval. With “lower”, I mean: with enough headroom.

Just for curiosity: would it solve the problem if you set the interval to 5 minutes and schedule all suites again?

(I know that this is a big pain point in Robotmk v1. Parallel execution is something which we are currently working on )

TheodoreG · October 12, 2023, 9:17am

The suites are pretty similar, so are all in the 16-20 second region (usually 16-17 seconds). If all four are enabled, that would be a 64-80 second execution time. Three suites is the breaking point, which would be 48-60 second excution time.

I agree with setting a liberal interval for checks would be a possible solution. Unfortunately I’m having trouble finding how to set this? I’ve gone through check_mk.user.yml but I’m only able to see configuration options for plugin timeout settings, which don’t appear to be having any effect. (Apologies if this is me being blind, but I’m struggling to find documentation on this).

simonm · October 12, 2023, 9:36am

Unfortunately I’m having trouble finding how to set this?

I assume you have used the bakery to deploy the Robotmk confguration to the host?
You can find the (async!) execution interval of the robotmk_runner plugin (which in fact does the execution) inside of this rule.

Let me know if this helps!
Best regards,
Simon

TheodoreG · October 12, 2023, 12:38pm

Ah I see, I thought it was a checkmk agent setting rather than a plugin one. Thanks very much for clarifying.

I’ve tried setting this but unfortunately no change. The robotmk log files for each suite look to still be generating every minute or so - should these not be executed after 5 minutes (to match the execution interval) if I’ve configured this correctly?

simonm · October 12, 2023, 12:57pm

There are two types of Robotmk plugins and hence two different log files:

robotmk_runner.log - log file of the “runner” plugin, which is executed asynchronously in the interval you are setting in the bakery. The runner is the script which executes suite after suite. That’s why the execution interval must always be higher than the total estimated runtime of all suites. Each suite result gets written into a JSON file in /log/robotmk/suitename.json
robotmk.log - log file of the plugin in /plugins which is executed synchronous, every minute. It reads the JSON result files and produces the agent output from that.

So, if you change the exec interval of the runner plugin to e.g. 10 minutes, the JSON result files should be overwritten every 10 minutes.

TheodoreG · October 26, 2023, 10:12am

Thanks for clarifying, that makes sense.

Despite setting the execution interval in the robotmk.yml file, it seems like it’s not being respected by the plugin.

I should have clarified previously but I’m using raw, so I don’t have access to the bakery and have configured the below config manually. Is it possible I’ve misconfigured it? Unfortunately I can’t seem to track down examples of how the robotmk.yml should look (though I must have found one at one point to have configured this initially):

global:
  context: local
  logdir: C:\ProgramData\checkmk\agent\robotmk
  tmpdir: C:\ProgramData\checkmk\agent\robotmk
  resultdir: C:\ProgramData\checkmk\agent\results
  robotdir: C:\ProgramData\checkmk\agent\robot
  log_level: INFO
  log_retention: '1'
  agent_output_encoding: utf_8
  cache_time: 180
  execution_mode: agent_serial
  execution_interval: 300
  robotframework:
    console: ''
    log: ''
    report: ''
suites: {}

simonm · October 26, 2023, 11:01am

That configuration looks rather old
Just as a suggestion: you could install the CMK free edition, install Robotmk there and use this instance to just generate the YML config. In this way you have the guarantee to have a working config - or, at least a template.

TheodoreG · October 27, 2023, 12:48pm

I’ve followed your advice to generate the following configuration, but still had issues:

# Created by Check_MK Agent Bakery.
# This file is managed via WATO, do not edit manually or you
# lose your changes next time when you update the agent.

# This file is part of Robotmk, a module for the integration of Robot
# framework test results into Checkmk.
#
# https://robotmk.org
# https://github.com/elabit/robotmk
# https://robotframework.org/
# ROBOTMK VERSION: v1.4.2

global:
  execution_mode: agent_serial
  agent_output_encoding: zlib_codec
  transmit_html: false
  log_level: INFO
  log_rotation: 3
  robotdir: C:\ProgramData\checkmk\agent\robot
  cache_time: 1800
  execution_interval: 900
suites: {}

Unfortunately the issue still persisted. I have opted for running the robotmk-runner.py every five minutes via task scheduler, which looks to have resolved the issue.

simonm · October 27, 2023, 1:23pm

Hi, using the task scheduler is the right (and only) way to go, when you have a UI application which must be started with the user which is logged on currently.

I think I see the problem.

What has mislead you is the execution_interval in robotmk.yml: it does not configure the Agent. It is used to be read by the Robotmk agent plugin to include the execution interval in the agent data, so that the check on the CMK server knows this interval. (The “Robotmk” graph for example shows this metric).

As you are using the raw edition and hence no bakery, you do not only get robotmk.yml generated, but also no CMK agent configuration. This is from there the agent knows the interval.

=> just grab the MSI installer of the CEE and install it to see how the complete agent config should look like.

PRO TIP: Sandboxie is an awesome tool for such tests. You can install the MSI in a sandbox, and then inspect the sandbox to see the changes on the system (FS+Registry)
After that - delete the sandbox.
In many cases this is much faster than handling with snapshots of a VM.