Defining a config file and an NDJSON standard for local checks

rawiriblundell · July 14, 2020, 11:53am

There’s a lot to reply to there, but I welcome the discussion.

I actually like the local checks as they are now because of the simplicity of creating such a check.

If it wasn’t clear, what I’m proposing must be backwards compatible. So if you want to use some random check script you found on the internet, go right ahead. If you want to create a new check using the old format output standard, also go right ahead etc.

The check that I demonstrated in my previous post is outputting in standard Nagios format, it’s simply captured by the agent, enriched with extra information and put into NDJSON format.

If the local check or the agent have to make anything meaningful with the data instead of just passing it to stdout there has to be logic in the agent which comes with additional dependencies like jq,

Additional logic - sure, but in most cases not that much. Additional dependencies like jq, well… jq isn’t required at all for this… If I run the above example check with debugging, we see this line in the output:

+:174:: set -- 2 check_procs - Process chrome not running 20 times

So let’s use the outcome of that for a simple example:

▓▒░$ printf -- '{ "service_name": "%s", "rc": %s, "stdout": "%s", "metrics": "%s" }\n' "${2}" "${1}" "${*:4}" "${3}" | jq -r '.'
{
  "service_name": "check_procs",
  "rc": 2,
  "stdout": "Process chrome not running 20 times",
  "metrics": "-"
}

Easy. I only use jq here to pretty-print it for the sake of readability.

That said, I have a library for making local checks a little simpler and more robust. You can see an experimental variant of it here. I similarly have a library for JSON formatting. It’s not mandatory to have, and you can read about it here.

So in an agent structure that’s split up as I proposed elsewhere, you should (IMHO) be able to do the following things for local checks, plugins and MRPE checks:

Use traditional Nagios formatting, the agent will capture and convert
Maybe copy a script template that has all the logic, ready to go, you just need to run whatever command you need to run and capture what you need into a few pre-named variables
If you want to handle the NDJSON formatting yourself, you can do so, you just need the mandatory fields and whatever else you want to add (and BTW, field order doesn’t matter)
Or you could import a library with some JSON functions and use that to simplify things for yourself

This change makes only sense if all the agent output is reformatted in JSON.

It’s a potential goal, possibly, but I’m thinking a bit bigger picture than that longer term. At the moment the agent output is in at least half a dozen different mixed standards with varying degrees of fragility.

So the agent have to contain the logic to parse the output of the commands. If for example ip link adds a new line for each device, the server plugin can easily detect that and e.g. switch to another parsing method. If the agent has to reformat the output to json, it has to do the understanding itself.

Ok, so in this hypothetical scenario where ip link adds a new line for each device, absolutely nothing should happen. The extra line should fundamentally be ignored, whether using the current formats, XML, NDJSON or whatever. To use that extra line, the server plugin would still need to be upgraded. That’s barely different from upgrading an agent across a fleet. Either way something’s getting upgraded, and either way it’s not a big deal.

I think this leads to more brittle agents, and as it is not fun to do all the understanding in bash, perhaps a new dependency on python which would then be needed on all systems and is probably not available.

ip link adds a new line and python is your knee jerk reaction?

One of the side effects of breaking the agent up should be increased robustness as each separate module is honed. That depends on faster code review turnaround, of course, and that very simple changes didn’t sit in the PR queue in github for months on end… In some cases over a year…

I also do not see a real benefit in this, as the server plugin still needs to exist. I dont think the server plugin benefits from this as it is in fact just moving around half of the server plugin to the agent and adding a new API to reunite them again.

Possibly. This should massively simplify things on the server side and distribute the processing load a little bit back to the clients, but on the other hand, it allows us to enrich/enhance/build-on the classic Nagios standard as we’re no longer bound to it. It’s an opportunity that allows us to fix the mess of 'Linux reports x-data with <<<header-x>>> and AIX reports the same x-data with <<<x-header>>> ’ inconsistencies that are across all of the agents, it allows the server to have more surety of what it has to work with - either an object has a keypair or it doesn’t - giving simpler backwards compatibility and error handling.

It also makes it easier for new code contributors, as JSON is readily understood, NDJSON is a minor difference, and documentation around checkmk’s bespoke standards are extremely poor to non-existent. There’s probably only about 20 people in the world who understand what sep means in <<<header:sep(56)>>>. And that whole sep thing is itself a dirty hack. If you’re looking for brittle, start there.

I have some other thoughts about massively reducing the client-side load as well, but these are big architectural steps, and we won’t get anywhere until the tribe29 guys get moving on some of the PR’s sitting in the queue. This would be a large stepping stone towards that, though.

In all honesty, I doubt that anything I’ve proposed will be implemented, not with my name on it at least.

Filename (or the name of a symlink pointing to this): mailbox_testuser_limit_20_megabyte.sh

I’ve seen this approach at serious scale. It can get really awful. So awful, in fact, that it was in part a motivator for me to come up with the indicated config file all those years ago.

The agent could load some convenience methods into the localchecks/plugins to support a common standard and a set of optional tools like a2enmod could support the process by creating the correct link-names and env-files from templates coming from a generic header (like the init info in init.d-files).

Yes, like the libraries I mentioned I have also done something similar to the whole a2enmod thing with my local checks framework, but that’s a whole other post.

Also, Ceph_df breaks if pool name is whitespace