Defining a config file and an NDJSON standard for local checks

rawiriblundell · March 30, 2020, 2:45am

While merging all the *nix agent scripts together (see PR#28), I went through a lot of undocumented, uncommented code and figured out what its intent was. In a few cases, I had to dig through old git commits, werks and mailing list archives to figure out “what were they thinking?!” I’ve added comments to explain non-obvious code, refactored questionable code, and mentioned sources when I can (e.g. “# see werks #[number]”)

As I’ve been splitting that code out (see Restructure the nix agents) I’ve been re-reading this code and thinking about how to simplify and/or improve the handling of MRPE, local and plugin checks, while possibly merging in some ideas from previous work that I’ve done around local checks.

So here’s a small demonstration:

▒░$ cat ../etc/checkmk-local.conf 
# shellcheck shell=ksh
# vim: noai:ts=4:sw=4:expandtab

# Copyright (C) 2019 tribe29 GmbH - License: GNU General Public License v2
# This file is part of Checkmk (https://checkmk.com). It is subject to the terms and
# conditions defined in the file COPYING, which is part of this source code package.
####################################################################################################

# Config file for local and plugin checks
# Expected format is 9 colon delimited fields
# 1:2:3:4:5:6:7:8:9, where:
# 1: Hostname or ALL, defines if the check is allowed to run
# 2: Service description.  If blank, defaults to the script filename
# 3: Script filename
# 4: Script args
# 5: Cache time, for scripts that don't need to (or shouldn't) run often
# 6: Run As
# 7: ITIL code
# 8: Resolver Group. Defines the team responsible for reacting to alerts
# 9: Trigger command. A command to run if a script meets a condition

ALL:Chrome_Process_Count:check_procs:chrome 20:::::

The observant will guess that this is the shadow config format with some sudoers inspiration.

For the sake of this demonstration, I’ve got a simple local check that counts the number of processes and alerts if the expected number is incorrect. This has an increasingly arguable use on the server side, but as I’m demonstrating on my desktop, I’ll use Chrome as a target process. We can see here that it’s outputting in the standard local check format:

▓▒░$ ../local/check_procs chrome 20
2 check_procs - Process chrome not running 20 times

Because the config file isn’t involved here, the script name is used for the service name. One of the upsides of the config file approach is that it allows scripts to be re-used with different args for different checks e.g.

ALL:root password age:check_pwage:root:43200::SEC:Systems Team:
ALL:grid password age:check_pwage:grid:43200::SEC:DBA Team:

So, now instead of blindly executing whatever is in a local or plugins directory, the config file instead becomes our source of truth: If it’s defined in the config file, it’s attempted. You can have 20 scripts in your local directory, if only one is defined in the config file, only that one will be attempted.

This approach, and the config file syntax, has advantages for scenarios where config management is used, and advantages where config management is not used.

So what we can also do is capture the output of existing check scripts, blend that output with the information from the config file and put it into a JSON output that looks something like this:

▓▒░$ ./checkmk_agent | tail -n 1 | jq
{
  "Local Checks": [
    {
      "service_name": "Chrome_Process_Count",
      "rc": 2,
      "status": "CRITICAL",
      "stdout": "Process chrome not running 20 times",
      "metrics": "-",
      "check_type": "local",
      "script_name": "check_procs",
      "script_args": "chrome 20",
      "script_runas": "rawiri",
      "cache_time": 0,
      "service_owner": null
    }
  ],
  "timestamp": {
    "utc_epoch": 1585510469
  }
}

In other words: with this approach, MRPE and local checks can be, for all intents and purposes, handled exactly the same, with check parameters and all. In the future, people might like to output directly in json format, which I’m anticipating, and that would be passed straight through.

So does something like this have any appeal? Any questions/feedback/suggestions?

chaase · June 26, 2020, 1:25pm

I actually like the local checks as they are now because of the simplicity of creating such a check. For example, I can run an agent with this local check on almost any system:

printf 'P Entropy_avail bits_available=%1$u;2000;1000 %1$u bits of entropy available\n' "$(cat /proc/sys/kernel/random/entropy_avail)"

If the local check or the agent have to make anything meaningful with the data instead of just passing it to stdout there has to be logic in the agent which comes with additional dependencies like jq, which I might not have available on a (probable heavily stripped) appliance I want to monitor.

This change makes only sense if all the agent output is reformatted in json. So the agent have to contain the logic to parse the output of the commands. If for example ip link adds a new line for each device, the server plugin can easily detect that and e.g. switch to another parsing method. If the agent has to reformat the output to json, it has to do the understanding itself.
I think this leads to more brittle agents, and as it is not fun to do all the understanding in bash, perhaps a new dependency on python which would then be needed on all systems and is probably not available.
I also do not see a real benefit in this, as the server plugin still needs to exist. I dont think the server plugin benefits from this as it is in fact just moving around half of the server plugin to the agent and adding a new API to reunite them again.

What I do like is the directory structure for the plugins and local checks from the other thread. When combining this with a parameter standard I think most of the features mentioned can be done without increasing complexity.

I parameterize my local checks in two ways and there are others for sure. For passing an identifier I pass it via $0:

Filename (or the name of a symlink pointing to this): mailbox_testuser_limit_20_megabyte.sh

#!/bin/bash

MAILBOX_NAME="$(basename "$0" | cut -d'_' -f2)"
LIMIT="$(basename "$0" | cut -d'_' -f4)"

echo "0 "$(basename "$0") - $MAILBOX_NAME has limit $LIMIT"

The other way is to use a env-file:

. "$(dirname "$0")/.$(basename "$0")"

In combination with your proposed folder structure:

localchecks-available/mailboxsize
localchecks-available/something_using_envfile
localchecks-enabled/mailbox_testuser_limit_20_megabyte.sh -> ../localchecks-available/check_mailboxsize
localchecks-enabled/something_envy -> ../localchecks-available/something_using_envfile
localchecks-enabled/.something_envy

The agent could load some convenience methods into the localchecks/plugins to support a common standard and a set of optional tools like a2enmod could support the process by creating the correct link-names and env-files from templates coming from a generic header (like the init info in init.d-files).

rawiriblundell · July 14, 2020, 11:53am

There’s a lot to reply to there, but I welcome the discussion.

I actually like the local checks as they are now because of the simplicity of creating such a check.

If it wasn’t clear, what I’m proposing must be backwards compatible. So if you want to use some random check script you found on the internet, go right ahead. If you want to create a new check using the old format output standard, also go right ahead etc.

The check that I demonstrated in my previous post is outputting in standard Nagios format, it’s simply captured by the agent, enriched with extra information and put into NDJSON format.

If the local check or the agent have to make anything meaningful with the data instead of just passing it to stdout there has to be logic in the agent which comes with additional dependencies like jq,

Additional logic - sure, but in most cases not that much. Additional dependencies like jq, well… jq isn’t required at all for this… If I run the above example check with debugging, we see this line in the output:

+:174:: set -- 2 check_procs - Process chrome not running 20 times

So let’s use the outcome of that for a simple example:

▓▒░$ printf -- '{ "service_name": "%s", "rc": %s, "stdout": "%s", "metrics": "%s" }\n' "${2}" "${1}" "${*:4}" "${3}" | jq -r '.'
{
  "service_name": "check_procs",
  "rc": 2,
  "stdout": "Process chrome not running 20 times",
  "metrics": "-"
}

Easy. I only use jq here to pretty-print it for the sake of readability.

That said, I have a library for making local checks a little simpler and more robust. You can see an experimental variant of it here. I similarly have a library for JSON formatting. It’s not mandatory to have, and you can read about it here.

So in an agent structure that’s split up as I proposed elsewhere, you should (IMHO) be able to do the following things for local checks, plugins and MRPE checks:

Use traditional Nagios formatting, the agent will capture and convert
Maybe copy a script template that has all the logic, ready to go, you just need to run whatever command you need to run and capture what you need into a few pre-named variables
If you want to handle the NDJSON formatting yourself, you can do so, you just need the mandatory fields and whatever else you want to add (and BTW, field order doesn’t matter)
Or you could import a library with some JSON functions and use that to simplify things for yourself

This change makes only sense if all the agent output is reformatted in JSON.

It’s a potential goal, possibly, but I’m thinking a bit bigger picture than that longer term. At the moment the agent output is in at least half a dozen different mixed standards with varying degrees of fragility.

So the agent have to contain the logic to parse the output of the commands. If for example ip link adds a new line for each device, the server plugin can easily detect that and e.g. switch to another parsing method. If the agent has to reformat the output to json, it has to do the understanding itself.

Ok, so in this hypothetical scenario where ip link adds a new line for each device, absolutely nothing should happen. The extra line should fundamentally be ignored, whether using the current formats, XML, NDJSON or whatever. To use that extra line, the server plugin would still need to be upgraded. That’s barely different from upgrading an agent across a fleet. Either way something’s getting upgraded, and either way it’s not a big deal.

I think this leads to more brittle agents, and as it is not fun to do all the understanding in bash, perhaps a new dependency on python which would then be needed on all systems and is probably not available.

ip link adds a new line and python is your knee jerk reaction?

One of the side effects of breaking the agent up should be increased robustness as each separate module is honed. That depends on faster code review turnaround, of course, and that very simple changes didn’t sit in the PR queue in github for months on end… In some cases over a year…

I also do not see a real benefit in this, as the server plugin still needs to exist. I dont think the server plugin benefits from this as it is in fact just moving around half of the server plugin to the agent and adding a new API to reunite them again.

Possibly. This should massively simplify things on the server side and distribute the processing load a little bit back to the clients, but on the other hand, it allows us to enrich/enhance/build-on the classic Nagios standard as we’re no longer bound to it. It’s an opportunity that allows us to fix the mess of 'Linux reports x-data with <<<header-x>>> and AIX reports the same x-data with <<<x-header>>> ’ inconsistencies that are across all of the agents, it allows the server to have more surety of what it has to work with - either an object has a keypair or it doesn’t - giving simpler backwards compatibility and error handling.

It also makes it easier for new code contributors, as JSON is readily understood, NDJSON is a minor difference, and documentation around checkmk’s bespoke standards are extremely poor to non-existent. There’s probably only about 20 people in the world who understand what sep means in <<<header:sep(56)>>>. And that whole sep thing is itself a dirty hack. If you’re looking for brittle, start there.

I have some other thoughts about massively reducing the client-side load as well, but these are big architectural steps, and we won’t get anywhere until the tribe29 guys get moving on some of the PR’s sitting in the queue. This would be a large stepping stone towards that, though.

In all honesty, I doubt that anything I’ve proposed will be implemented, not with my name on it at least.

Filename (or the name of a symlink pointing to this): mailbox_testuser_limit_20_megabyte.sh

I’ve seen this approach at serious scale. It can get really awful. So awful, in fact, that it was in part a motivator for me to come up with the indicated config file all those years ago.

The agent could load some convenience methods into the localchecks/plugins to support a common standard and a set of optional tools like a2enmod could support the process by creating the correct link-names and env-files from templates coming from a generic header (like the init info in init.d-files).

Yes, like the libraries I mentioned I have also done something similar to the whole a2enmod thing with my local checks framework, but that’s a whole other post.

Also, Ceph_df breaks if pool name is whitespace