While merging all the *nix agent scripts together (see PR#28), I went through a lot of undocumented, uncommented code and figured out what its intent was. In a few cases, I had to dig through old git commits, werks and mailing list archives to figure out “what were they thinking?!” I’ve added comments to explain non-obvious code, refactored questionable code, and mentioned sources when I can (e.g. “# see werks #[number]”)
As I’ve been splitting that code out (see Restructure the nix agents) I’ve been re-reading this code and thinking about how to simplify and/or improve the handling of MRPE, local and plugin checks, while possibly merging in some ideas from previous work that I’ve done around local checks.
So here’s a small demonstration:
▒░$ cat ../etc/checkmk-local.conf
# shellcheck shell=ksh
# vim: noai:ts=4:sw=4:expandtab
# Copyright (C) 2019 tribe29 GmbH - License: GNU General Public License v2
# This file is part of Checkmk (https://checkmk.com). It is subject to the terms and
# conditions defined in the file COPYING, which is part of this source code package.
####################################################################################################
# Config file for local and plugin checks
# Expected format is 9 colon delimited fields
# 1:2:3:4:5:6:7:8:9, where:
# 1: Hostname or ALL, defines if the check is allowed to run
# 2: Service description. If blank, defaults to the script filename
# 3: Script filename
# 4: Script args
# 5: Cache time, for scripts that don't need to (or shouldn't) run often
# 6: Run As
# 7: ITIL code
# 8: Resolver Group. Defines the team responsible for reacting to alerts
# 9: Trigger command. A command to run if a script meets a condition
ALL:Chrome_Process_Count:check_procs:chrome 20:::::
The observant will guess that this is the shadow
config format with some sudoers
inspiration.
For the sake of this demonstration, I’ve got a simple local check that counts the number of processes and alerts if the expected number is incorrect. This has an increasingly arguable use on the server side, but as I’m demonstrating on my desktop, I’ll use Chrome as a target process. We can see here that it’s outputting in the standard local check format:
▓▒░$ ../local/check_procs chrome 20
2 check_procs - Process chrome not running 20 times
Because the config file isn’t involved here, the script name is used for the service name. One of the upsides of the config file approach is that it allows scripts to be re-used with different args for different checks e.g.
ALL:root password age:check_pwage:root:43200::SEC:Systems Team:
ALL:grid password age:check_pwage:grid:43200::SEC:DBA Team:
So, now instead of blindly executing whatever is in a local or plugins directory, the config file instead becomes our source of truth: If it’s defined in the config file, it’s attempted. You can have 20 scripts in your local directory, if only one is defined in the config file, only that one will be attempted.
This approach, and the config file syntax, has advantages for scenarios where config management is used, and advantages where config management is not used.
So what we can also do is capture the output of existing check scripts, blend that output with the information from the config file and put it into a JSON output that looks something like this:
▓▒░$ ./checkmk_agent | tail -n 1 | jq
{
"Local Checks": [
{
"service_name": "Chrome_Process_Count",
"rc": 2,
"status": "CRITICAL",
"stdout": "Process chrome not running 20 times",
"metrics": "-",
"check_type": "local",
"script_name": "check_procs",
"script_args": "chrome 20",
"script_runas": "rawiri",
"cache_time": 0,
"service_owner": null
}
],
"timestamp": {
"utc_epoch": 1585510469
}
}
In other words: with this approach, MRPE and local checks can be, for all intents and purposes, handled exactly the same, with check parameters and all. In the future, people might like to output directly in json format, which I’m anticipating, and that would be passed straight through.
So does something like this have any appeal? Any questions/feedback/suggestions?