Restructure the nix agents

Hi,

Apologies in advance for the wall of text.

For those who haven’t seen my name yet, I’ve put in some contributions towards the improvement of the various *nix agent scripts, including writing a near-complete POSIX compatible merge of them all, which has included some significant overhauls and code improvements (see PR #28)

As I’ve been tracking commits to the existing scripts and merging them into the monolithic merged script, it has become increasingly clear to me that such an approach for the *nix agent is not scalable or manageable. I’ve actually had those misgivings from the very start.

I have had other issues with the older agents when deployed across different *nix variants, Linux distros or even packages within the same distro (e.g. the tribe29 rpm and EPEL rpm use different directories). And it’s all a bit stupid (IMHO), because paths like /usr/lib/check_mk_agent/local/, /usr/share/check-mk-agent/local, /usr/share/check-mk-agent/plugins and similar examples are non-obvious and, honestly, a little bit obnoxious.

So my proposal is to restructure the nix agents to use a modular approach based in /opt/checkmk/agent

While /opt seems to be Linux-centric, the latest version of the FHS states:

Rationale
The use of /opt for add-on software is a well-established practice in the UNIX community. TheSystem V Application Binary Interface [AT&T 1990], based on the System V Interface Definition(Third Edition), provides for an /opt structure very similar to the one defined here.The Intel Binary Compatibility Standard v. 2 (iBCS2) also provides a similar structure for /opt.

And, obviously, anybody can package it to reside elsewhere if they choose.

I propose splitting the massive merged monolithic script up into modules and simple libraries. Such a structure might look something like

/opt/checkmk/agent/bin/checkmk_agent    # agent script
/opt/checkmk/agent/lib/common.sh        # lib path, referencing a shell library of common functions
/opt/checkmk/agent/include/common.sh    # possible alternative to lib
/opt/checkmk/agent/local-available/     # path for available local checks
/opt/checkmk/agent/local-enabled/       # path for available local checks that will be run by the agent
/opt/checkmk/agent/plugins-available/   # as above, but for plugins
/opt/checkmk/agent/plugins-enabled/

And so on with other pieces of structure, along with symlinks (which can be managed via package scripts) where required e.g:

/opt/checkmk/agent/var/ --> /var/opt/checkmk/agent/
/opt/checkmk/agent/etc/ --> /etc/opt/checkmk/agent/
/opt/checkmk/agent/tmp/ --> /tmp/checkmk/agent/

The agent’s job is then greatly simplified - it’s invoked via xinetd or systemd or some other method, it attempts to find a sane interpreter, sets some environment variables and then loops through whatever is defined in local-enabled and plugins-enabled. Much of what is currently in the agent script can then be spun out to either local-available or plugins-available (or, alternatively, some other path like /opt/checkmk/agent/core-checks/)

Once this modular approach is implemented, it then becomes far easier to apply fixes and improvements in isolation from the rest of the agent code. For example, PR #116 should have only applied to a file like /opt/checkmk/agent/core-checks/timesync.sh.

The modular approach conveniently fixes the main outstanding issue in PR #28 i.e. POSIX cannot easily/readily export functions. In the modular layout, we can simply have those functions in the bin/ directory as standalone scripts.

The other thing that this modular approach enables is a potential migration towards (ND)JSON style output. The thought of trying to do that in the merged nix agent script just fills me with dread. With the modular approach, however…

▓▒░$ bash checkmk_agent | jq -r '.'
{
  "checkmk": {
    "Version": "TESTING",
    "AgentOS": "linux",
    "Hostname": "minty",
    "AgentDirectory": "/etc/check_mk",
    "DataDirectory": "/var/lib/check_mk_agent",
    "SpoolDirectory": "/var/lib/check_mk_agent/spool",
    "PluginsDirectory": "/usr/lib/check_mk_agent/plugins",
    "LocalDirectory": "/usr/lib/check_mk_agent/local"
  },
  "timestamp": {
    "utc_epoch": 1583230981
  }
}

I’m happy to do much of the heavy lifting on the agent side of the equation, if anyone is interested in making the requisite changes on the server side.

Any questions and/or feedback appreciated :slight_smile:

After a little more fun with NDJSON formatting:

▓▒░$ bash checkmk_agent    
{"checkmk": {"Version": "testing-json", "AgentOS": "linux", "Hostname": "minty", "AgentDirectory": "/home/rawiri/git/checkMK/agents/nix/etc", "DataDirectory": "/home/rawiri/git/checkMK/agents/nix/var", "SpoolDirectory": "/home/rawiri/git/checkMK/agents/nix/var/spool", "PluginsDirectory": "/home/rawiri/git/checkMK/agents/nix/plugins-enabled", "LocalDirectory": "/home/rawiri/git/checkMK/agents/nix/local-enabled"}, "timestamp": {"utc_epoch": 1583318477}}
<<<fileinfo:sep(124)>>>
1583318477
[[[header]]]
name|status|size|time
[[[content]]]
/tmp/validate_tld|ok|551|1583308175
/tmp/pants|missing
{"fileinfo": [{"name": "/tmp/validate_tld", "status": "ok", "size": 551, "time": 1583308175},{"name": "/tmp/pants", "status": "missing", "size": null, "time": null} ], "timestamp": {"utc_epoch": 1583318477}}
{"uptime": {"uptime": 1583021.05, "idle": 5110854.69, "who_b": "system boot  Feb 15 15:57"}}

So for the sake of comparison, I’ve kept the existing layout for the fileinfo check, and an example of how it might be represented in json format:

<<<fileinfo:sep(124)>>>
1583318477
[[[header]]]
name|status|size|time
[[[content]]]
/tmp/validate_tld|ok|551|1583308175
/tmp/pants|missing

vs

{"fileinfo": [{"name": "/tmp/validate_tld", "status": "ok", "size": 551, "time": 1583308175},{"name": "/tmp/pants", "status": "missing", "size": null, "time": null} ], "timestamp": {"utc_epoch": 1583318477}}

Or, when pretty printed:

{
  "fileinfo": [
    {
      "name": "/tmp/validate_tld",
      "status": "ok",
      "size": 551,
      "time": 1583308175
    },
    {
      "name": "/tmp/pants",
      "status": "missing",
      "size": null,
      "time": null
    }
  ],
  "timestamp": {
    "utc_epoch": 1583353110
  }
}

The structure of this is such that adding extra fields like mode, owner, group and checksum are dead simple, and this improves the capability of fileinfo, to the point that it fundamentally becomes a FIM.

Actually, this was so easy that I went ahead and did it:

{
  "fileinfo": [
    {
      "name": "/tmp/validate_tld",
      "status": "ok",
      "size": 551,
      "uid": 1000,
      "gid": 1000,
      "mode": 640,
      "atime": 1583358170,
      "mtime": 1583308175,
      "checksum": "a0d4c6b2ff06279f242eb38e4e7a01ca85d5a8444cb9a43f16ca666d992b41eb"
    },
    {
      "name": "/tmp/pants",
      "status": "missing",
      "size": null,
      "uid": null,
      "gid": null,
      "mode": null,
      "atime": null,
      "mtime": null,
      "checksum": null
    }
  ],
  "timestamp": {
    "utc_epoch": 1583358463
  }
}
1 Like

Example code can now be accessed here:

It is very rough around the edges with a lot to be fixed.

I already put mine in /opt/check_mk with etc, bin, plugins, and local under there. Makes more sense to me than the other locations you mention.

It seems your idea about splitting things out makes sense as long as it’s managed centrally. I guess maybe if you use the Agent Bakery (maybe enterprise only feature?) then it’s not such a big deal. But i wouldn’t want to have to touch each server (even via ansible) to enable/disable the checks I want.

Also, on a different not, I don’t think any repos should include the check_mk agent. We had a problem (before we upgraded to 1.6 at least) where yum update would replace our agent with our custom paths, plugins, local checks, etc. with the agent from the EPEL repo since it was newer. Number 1, I lost all my plugins, local checks, etc… and number 2, the agent cannot be newer than the server! So that broke stuff constantly if we forgot to exclude that package from yum.

Hi Lance,
yeah, I have used /opt/check_mk_agent/ in a couple of different ways at a previous job and it just made so much sense to have it somewhere in /opt rather than in an unpredictable place somewhere in /usr.

It seems your idea about splitting things out makes sense as long as it’s managed centrally.

Hmmm that’s not really what I’m getting at. It would still be packaged and deployed as usual, and anyone can then overlay the base package install with custom local checks, plugins etc however they please - bakery, ansible, rsync… whatever works for them.

Splitting the agent code out is more about code manageability and setting a foundation for potential future improvements. The merged agent script is getting up to 3k lines, is a bit unwieldy and it still has room to grow i.e. where it has capability for Linux that needs to be filled in for Solaris, AIX etc. I know from experience that if this is genuinely followed, then this can massively blow out the amount of code :frowning:

It’s also about being able to make commits to the git repo in relative isolation. At the moment I have something like half a dozen PR’s sitting there, I also have a massive backlog of commits to throw in but, frustratingly, I’m kinda blocked by those PR’s - I can’t really commit any further without invoking merge conflicts. With things split up, I - or anyone else - can simply have a branch per target item and queue up whatever number of commits I like.

I’m also trying to set some groundwork for untangling the mess of how MRPE, local checks, plugins, inventory scripts and whatever else are handled.

As demonstrated above, it also opens up the opportunity to throw in a bunch of code to generate ndjson structures that the server side can process with plain old json libraries, rather than check_mk’s somewhat fragile looking incumbent standard.

But, as far as central management goes, I am leaning towards introducing a couple of config files. I discuss one here which could be actively centrally managed or not… and I’m still forming my ideas about the other one, but at this stage I’m envisioning it being one that you can optionally manage.

That linked config file standard is an improved descendant of how I recall managing local checks in /opt/check_mk_agent at my previous employer. I settled on its syntax probably 5 or 6 years ago specifically to deal with customers who used Ansible, Puppet etc and customers who had no config management at all (i.e. you can craft a single monolithic file and deploy it… however… just as you can with sudoers)

Perhaps there are other avenues like having monitored hosts pull their configs down, maybe via the newfangled api? Or the bakery… I don’t know :slight_smile:

I don’t think any repos should include the check_mk agent.

Absolutely agree :slight_smile: To the rest of your packaging issues, at my aforementioned previous employer we’d just use yum-versionlock. Single line change in the ansible inventory, then go on with your life :smiley:

Thanks for the feedback, though, it’s nice getting a break from the radio silence

1 Like

FWIW I’ve changed my mind on this design decision:

/opt/checkmk/agent/local-available/     # path for available local checks
/opt/checkmk/agent/local-enabled/       # path for available local checks that will be run by the agent
/opt/checkmk/agent/plugins-available/   # as above, but for plugins
/opt/checkmk/agent/plugins-enabled/

Instead, I’d go for something like:

/opt/checkmk/agent/localchecks/    # MRPE checks would also go in here
/opt/checkmk/agent/plugins/

You can put whatever you want in those directories, what determines whether or not they’re run is the config file that I discuss over here