Restructure the nix agents

rawiriblundell · March 3, 2020, 10:58am

Hi,

Apologies in advance for the wall of text.

For those who haven’t seen my name yet, I’ve put in some contributions towards the improvement of the various *nix agent scripts, including writing a near-complete POSIX compatible merge of them all, which has included some significant overhauls and code improvements (see PR #28)

As I’ve been tracking commits to the existing scripts and merging them into the monolithic merged script, it has become increasingly clear to me that such an approach for the *nix agent is not scalable or manageable. I’ve actually had those misgivings from the very start.

I have had other issues with the older agents when deployed across different *nix variants, Linux distros or even packages within the same distro (e.g. the tribe29 rpm and EPEL rpm use different directories). And it’s all a bit stupid (IMHO), because paths like /usr/lib/check_mk_agent/local/, /usr/share/check-mk-agent/local, /usr/share/check-mk-agent/plugins and similar examples are non-obvious and, honestly, a little bit obnoxious.

So my proposal is to restructure the nix agents to use a modular approach based in /opt/checkmk/agent

While /opt seems to be Linux-centric, the latest version of the FHS states:

Rationale
The use of /opt for add-on software is a well-established practice in the UNIX community. TheSystem V Application Binary Interface [AT&T 1990], based on the System V Interface Definition(Third Edition), provides for an /opt structure very similar to the one defined here.The Intel Binary Compatibility Standard v. 2 (iBCS2) also provides a similar structure for /opt.

And, obviously, anybody can package it to reside elsewhere if they choose.

I propose splitting the massive merged monolithic script up into modules and simple libraries. Such a structure might look something like

/opt/checkmk/agent/bin/checkmk_agent    # agent script
/opt/checkmk/agent/lib/common.sh        # lib path, referencing a shell library of common functions
/opt/checkmk/agent/include/common.sh    # possible alternative to lib
/opt/checkmk/agent/local-available/     # path for available local checks
/opt/checkmk/agent/local-enabled/       # path for available local checks that will be run by the agent
/opt/checkmk/agent/plugins-available/   # as above, but for plugins
/opt/checkmk/agent/plugins-enabled/

And so on with other pieces of structure, along with symlinks (which can be managed via package scripts) where required e.g:

/opt/checkmk/agent/var/ --> /var/opt/checkmk/agent/
/opt/checkmk/agent/etc/ --> /etc/opt/checkmk/agent/
/opt/checkmk/agent/tmp/ --> /tmp/checkmk/agent/

The agent’s job is then greatly simplified - it’s invoked via xinetd or systemd or some other method, it attempts to find a sane interpreter, sets some environment variables and then loops through whatever is defined in local-enabled and plugins-enabled. Much of what is currently in the agent script can then be spun out to either local-available or plugins-available (or, alternatively, some other path like /opt/checkmk/agent/core-checks/)

Once this modular approach is implemented, it then becomes far easier to apply fixes and improvements in isolation from the rest of the agent code. For example, PR #116 should have only applied to a file like /opt/checkmk/agent/core-checks/timesync.sh.

The modular approach conveniently fixes the main outstanding issue in PR #28 i.e. POSIX cannot easily/readily export functions. In the modular layout, we can simply have those functions in the bin/ directory as standalone scripts.

The other thing that this modular approach enables is a potential migration towards (ND)JSON style output. The thought of trying to do that in the merged nix agent script just fills me with dread. With the modular approach, however…

▓▒░$ bash checkmk_agent | jq -r '.'
{
  "checkmk": {
    "Version": "TESTING",
    "AgentOS": "linux",
    "Hostname": "minty",
    "AgentDirectory": "/etc/check_mk",
    "DataDirectory": "/var/lib/check_mk_agent",
    "SpoolDirectory": "/var/lib/check_mk_agent/spool",
    "PluginsDirectory": "/usr/lib/check_mk_agent/plugins",
    "LocalDirectory": "/usr/lib/check_mk_agent/local"
  },
  "timestamp": {
    "utc_epoch": 1583230981
  }
}

I’m happy to do much of the heavy lifting on the agent side of the equation, if anyone is interested in making the requisite changes on the server side.

Any questions and/or feedback appreciated

rawiriblundell · March 4, 2020, 9:54pm

After a little more fun with NDJSON formatting:

▓▒░$ bash checkmk_agent    
{"checkmk": {"Version": "testing-json", "AgentOS": "linux", "Hostname": "minty", "AgentDirectory": "/home/rawiri/git/checkMK/agents/nix/etc", "DataDirectory": "/home/rawiri/git/checkMK/agents/nix/var", "SpoolDirectory": "/home/rawiri/git/checkMK/agents/nix/var/spool", "PluginsDirectory": "/home/rawiri/git/checkMK/agents/nix/plugins-enabled", "LocalDirectory": "/home/rawiri/git/checkMK/agents/nix/local-enabled"}, "timestamp": {"utc_epoch": 1583318477}}
<<<fileinfo:sep(124)>>>
1583318477
[[[header]]]
name|status|size|time
[[[content]]]
/tmp/validate_tld|ok|551|1583308175
/tmp/pants|missing
{"fileinfo": [{"name": "/tmp/validate_tld", "status": "ok", "size": 551, "time": 1583308175},{"name": "/tmp/pants", "status": "missing", "size": null, "time": null} ], "timestamp": {"utc_epoch": 1583318477}}
{"uptime": {"uptime": 1583021.05, "idle": 5110854.69, "who_b": "system boot  Feb 15 15:57"}}

So for the sake of comparison, I’ve kept the existing layout for the fileinfo check, and an example of how it might be represented in json format:

<<<fileinfo:sep(124)>>>
1583318477
[[[header]]]
name|status|size|time
[[[content]]]
/tmp/validate_tld|ok|551|1583308175
/tmp/pants|missing

vs

{"fileinfo": [{"name": "/tmp/validate_tld", "status": "ok", "size": 551, "time": 1583308175},{"name": "/tmp/pants", "status": "missing", "size": null, "time": null} ], "timestamp": {"utc_epoch": 1583318477}}

Or, when pretty printed:

{
  "fileinfo": [
    {
      "name": "/tmp/validate_tld",
      "status": "ok",
      "size": 551,
      "time": 1583308175
    },
    {
      "name": "/tmp/pants",
      "status": "missing",
      "size": null,
      "time": null
    }
  ],
  "timestamp": {
    "utc_epoch": 1583353110
  }
}

The structure of this is such that adding extra fields like mode, owner, group and checksum are dead simple, and this improves the capability of fileinfo, to the point that it fundamentally becomes a FIM.

Actually, this was so easy that I went ahead and did it:

{
  "fileinfo": [
    {
      "name": "/tmp/validate_tld",
      "status": "ok",
      "size": 551,
      "uid": 1000,
      "gid": 1000,
      "mode": 640,
      "atime": 1583358170,
      "mtime": 1583308175,
      "checksum": "a0d4c6b2ff06279f242eb38e4e7a01ca85d5a8444cb9a43f16ca666d992b41eb"
    },
    {
      "name": "/tmp/pants",
      "status": "missing",
      "size": null,
      "uid": null,
      "gid": null,
      "mode": null,
      "atime": null,
      "mtime": null,
      "checksum": null
    }
  ],
  "timestamp": {
    "utc_epoch": 1583358463
  }
}

rawiriblundell · March 21, 2020, 12:03pm

Example code can now be accessed here:

It is very rough around the edges with a lot to be fixed.

lance · April 29, 2020, 2:53pm

I already put mine in /opt/check_mk with etc, bin, plugins, and local under there. Makes more sense to me than the other locations you mention.

It seems your idea about splitting things out makes sense as long as it’s managed centrally. I guess maybe if you use the Agent Bakery (maybe enterprise only feature?) then it’s not such a big deal. But i wouldn’t want to have to touch each server (even via ansible) to enable/disable the checks I want.

Also, on a different not, I don’t think any repos should include the check_mk agent. We had a problem (before we upgraded to 1.6 at least) where yum update would replace our agent with our custom paths, plugins, local checks, etc. with the agent from the EPEL repo since it was newer. Number 1, I lost all my plugins, local checks, etc… and number 2, the agent cannot be newer than the server! So that broke stuff constantly if we forgot to exclude that package from yum.

rawiriblundell · April 30, 2020, 11:01am

Hi Lance,
yeah, I have used /opt/check_mk_agent/ in a couple of different ways at a previous job and it just made so much sense to have it somewhere in /opt rather than in an unpredictable place somewhere in /usr.

It seems your idea about splitting things out makes sense as long as it’s managed centrally.

Hmmm that’s not really what I’m getting at. It would still be packaged and deployed as usual, and anyone can then overlay the base package install with custom local checks, plugins etc however they please - bakery, ansible, rsync… whatever works for them.

Splitting the agent code out is more about code manageability and setting a foundation for potential future improvements. The merged agent script is getting up to 3k lines, is a bit unwieldy and it still has room to grow i.e. where it has capability for Linux that needs to be filled in for Solaris, AIX etc. I know from experience that if this is genuinely followed, then this can massively blow out the amount of code

It’s also about being able to make commits to the git repo in relative isolation. At the moment I have something like half a dozen PR’s sitting there, I also have a massive backlog of commits to throw in but, frustratingly, I’m kinda blocked by those PR’s - I can’t really commit any further without invoking merge conflicts. With things split up, I - or anyone else - can simply have a branch per target item and queue up whatever number of commits I like.

I’m also trying to set some groundwork for untangling the mess of how MRPE, local checks, plugins, inventory scripts and whatever else are handled.

As demonstrated above, it also opens up the opportunity to throw in a bunch of code to generate ndjson structures that the server side can process with plain old json libraries, rather than check_mk’s somewhat fragile looking incumbent standard.

But, as far as central management goes, I am leaning towards introducing a couple of config files. I discuss one here which could be actively centrally managed or not… and I’m still forming my ideas about the other one, but at this stage I’m envisioning it being one that you can optionally manage.

That linked config file standard is an improved descendant of how I recall managing local checks in /opt/check_mk_agent at my previous employer. I settled on its syntax probably 5 or 6 years ago specifically to deal with customers who used Ansible, Puppet etc and customers who had no config management at all (i.e. you can craft a single monolithic file and deploy it… however… just as you can with sudoers)

Perhaps there are other avenues like having monitored hosts pull their configs down, maybe via the newfangled api? Or the bakery… I don’t know

I don’t think any repos should include the check_mk agent.

Absolutely agree To the rest of your packaging issues, at my aforementioned previous employer we’d just use yum-versionlock. Single line change in the ansible inventory, then go on with your life

Thanks for the feedback, though, it’s nice getting a break from the radio silence

rawiriblundell · April 30, 2020, 11:05am

FWIW I’ve changed my mind on this design decision:

/opt/checkmk/agent/local-available/     # path for available local checks
/opt/checkmk/agent/local-enabled/       # path for available local checks that will be run by the agent
/opt/checkmk/agent/plugins-available/   # as above, but for plugins
/opt/checkmk/agent/plugins-enabled/

Instead, I’d go for something like:

/opt/checkmk/agent/localchecks/    # MRPE checks would also go in here
/opt/checkmk/agent/plugins/

You can put whatever you want in those directories, what determines whether or not they’re run is the config file that I discuss over here

foobar · October 6, 2021, 3:14pm

we saw and actually used some of them. Thanks for sharing
We totally agree and upvote to finally implement and simply the agents as you suggested.

And directly adding similiar behaviour like prometheus agent (@andreas-doehler maybe you wanna add some suggestions to push this even more) or at least proper parallelization to speed it up, especially for complex UNIX hosts with lots of applications and hardware.

Buy the way, thanks for your work and I hope for the better of CMK finally some of your PR’s will run through

rawiriblundell · October 9, 2021, 9:56am

Buy the way, thanks for your work and I hope for the better of CMK finally some of your PR’s will run through

Hey, thanks for that. Sadly, I’ve recently closed all my PR’s because I just got tired of waiting and I’ve been slowly, but surely, mentally checking out. Like, I barely care to contribute anymore, if I’m honest. And it hasn’t just been on my own PR’s either; There were PR’s there (not mine) for typo-level fixes that just sat untouched for several months until I or somebody else made some noise.

This isn’t a good look for tribe29. There’s repeated claims of “we’re really interested but we totally don’t have the time”… well that’s just giving a public perception of an organisation that lacks basic planning. How hard is it to have a team culture where you say “ok, we might be slammed with work, but let’s have a PR Friday where once a week/fortnight/month we are dedicated to reviewing the PR queue, and we update every single PR, even if it goes no further”? Or having a rotating roster where everyone has rostered time to attend to the PR queue? There’s probably a dozen sane ways that this can be structured…

Anyway, because I’m now mentally re-engaged in this, I figure it’s brain-dump time. For anyone who’s interested.

So, following on from the above suggestion, I built a proof of concept for a method to improve agent efficiency. I hinted very lightly at that here:

I’m also trying to set some groundwork for untangling the mess of how MRPE, local checks, plugins, inventory scripts and whatever else are handled.

In my view, MRPE, local checks, plugins etc should all be handled in the same way: their output cached with a timestamp and a TTL/Expiry. Kind of like a grown-up version of the spooldir and run_cached mechanisms that exists today. For example, let’s take something like this from my first post:

▓▒░$ bash checkmk_agent | jq -r '.'
{
  "checkmk": {
    "Version": "TESTING",
    "AgentOS": "linux",
    "Hostname": "minty",
    "AgentDirectory": "/etc/check_mk",
    "DataDirectory": "/var/lib/check_mk_agent",
    "SpoolDirectory": "/var/lib/check_mk_agent/spool",
    "PluginsDirectory": "/usr/lib/check_mk_agent/plugins",
    "LocalDirectory": "/usr/lib/check_mk_agent/local"
  },
  "timestamp": {
    "utc_epoch": 1583230981
  }
}

Is it really necessary to transmit that information every polling period? Nope. And frankly it’s just the agent version that we care about. And in some cases it might have an attached cost (e.g. cloud data costs at some scale point). Transmitting once-ish upon agent start up and then every 24 hours is a bit saner. So let’s add something to it

▓▒░$ bash checkmk_agent | jq -r '.'
{
  "checkmk": {
    "Version": "TESTING",
    "AgentOS": "linux",
    "Hostname": "minty",
    "AgentDirectory": "/etc/check_mk",
    "DataDirectory": "/var/lib/check_mk_agent",
    "SpoolDirectory": "/var/lib/check_mk_agent/spool",
    "PluginsDirectory": "/usr/lib/check_mk_agent/plugins",
    "LocalDirectory": "/usr/lib/check_mk_agent/local"
  },
  "timestamp": {
    "utc_epoch": 1583230981
  },
  "expiry": {
    "utc_epoch": 1583317381
  }
}

Ok, so with that simple addition, and by having the agent cache that somewhere e.g. /opt/checkmk/agent/cache/agent.cache, the agent can quickly determine that in the time period between 1583230981 and 1583317381 that this information does not need to be regenerated, recached or retransmitted.

The server side also has that information and knows not to bother expecting any change until after 1583317381.

Then the question becomes: If we’re splitting out the agent in this way to make everything asynchronous, how do we ensure the agents and server(s) stay synced up? A simple interim solution would be to have the agent check each cached object’s timestamp. If the difference from the timestamp is less than, say, 180 seconds, then transmit it the next time the system is polled.

The next obvious step is to make it even more Prometheus-like by storing cached objects somewhere like /srv/checkmk/api/v1/. So the above agent information would be a simple HTTP GET http://remotehost/checkmk/api/v1/agent. This may then allow some legacy stuff like xinetd and custom ports to be retired. And it could potentially also simplify the code for encrypting information.

So now the decision about whether or not to transfer a piece of information can be entirely up to the server. In the above example, it sees an object named agent that’s expired, so it issues a HTTP GET http://remotehost/checkmk/api/v1/agent to refresh it. Obviously if it times out or an old object is transferred, the respective check goes into the appropriate state.

Downside: we’ve added an http server as a dependency, but weirdly that should be a bit more palatable than xinetd, especially for those of us who have to deal with SOC techs. For a quick and dirty solution, something basic could be achieved with netcat. For something more involved with access rules and certificates and the like, that can be over to the sysadmin’s whim. Speaking as a sysadmin, I have no problem deploying nginx and some configs around my fleet with ansible. Snore. OpenWRT is also fine in this regard, with netcat, apache, nginx, busybox httpd and their own uHTTPd available.

This:

  "timestamp": {
    "utc_epoch": 1583230981
  },
  "expiry": {
    "utc_epoch": 1583317381
  }

Could be structured differently too e.g.

  "timestamps_utc": {
    "module_run_start": 1583230981,
    "module_run_end": 1583230983,
    "object_expiry": 1583317381
  },
  "module_metrics": {
     "return_code": 0,
     "real": "0m2.000s",
     "user": "0m0.060s",
     "sys": "0m0.060s"
  }

The possibilities are endless, and very much ready to go two years ago.

foobar · October 11, 2021, 8:32am

Thanks for your reply!

That’s the worst that can happen and we metioned this before that it’s pushing motivated people away - really a pity and sorry to hear!

I mean I can understand it’s taking time and its hard to orchestrate, but yours sitting there for ages and many thinks in this PRs are a improvement to whats right now. Especially aligning all UNIX agents just on the basic structure, functionality and speed

I (we) know - been watching your PRs for a long time.

agreed - why make it complicated and create different ways (not discussing which one would be the best and more efficient.)

LaSoe · October 12, 2021, 8:30am

Unfortunately, the current agent design means that not all basic features (local, mrpe, async, interval, waitmax, spooldir, encryption, etc) and improvements are always available in all agents.

Customers who use different operating systems and want to use these functions must laboriously reorder them for each operating system as a feature request which under certain circumstances can mean 1-2 years of waiting because other things on the road map have more priority right now than a single request of a customer.

This also makes it difficult to introduce new ideas such as example the parallel execution of plugins, mrpe or local checks to improve the agent runtime. Because this must be done separately for each agent this is very time-consuming and expensive.

It is very unfortunate that good ideas and suggestions from community members like you, who probably work with CMK more than most developers at Tribe29 and have to live with these shortcomings, just peter out like this.

Please don’t give up the good work. It might still work out someday.

jan.justus · October 22, 2021, 3:44pm

First of all I want to apologize. It’s simply not good to have a great community and GitHub presence and to then let the pull requests “die”. I understand how that takes away any motivation to continue contributing.

The challenge for us as a small organization is how to prioritize our resources between product strategy choices, feature request, enterprise support, feedback, community contributions and similar. But I don’t want to give lame excuses. Our performance in working on PRs is not good. We will work on it and hope we can do better going forward.

In terms of code contributions, we have so far been a rather company-led project with smaller PRs. What is different in your projects is that you are planning bigger architectural changes of the agents. This is great and harmonizing the agents obviously has both customer and maintenance benefits. But it also meant that we need to get involved more to review larger PRs and think through the changes. This caused the process to get stuck on our side, since the team is currently focused on other priorities.

What I would propose – if you are still interested to reengage – is to actually collaborate more closely initially to make the collaboration more effective:

Start this effort of with a virtual planning session (video or telco) with you and members of our development team.

Align on the current state, changes in our master branch etc.
Jointly sketch out how to evolve the agents architecturally and pitfalls to watch out for based on our development and support experience across different customer types
Align on how to make the PR process productive for both sides (how to split them etc)
Help us understand what is important for you as a major contributor

Once this step (probably a bit unusual for an open source project) is done, it is much easier to shift to asynchronous mode again.

Let us know, if you are interested. Thanks!

All the best,

Jan

rawiriblundell · October 29, 2021, 3:08am

Hi @jan.justus,
Thank you for taking the time to reach out and for pledging to improve upon the issues that have been raised.

In terms of code contributions, we have so far been a rather company-led project with smaller PRs. What is different in your projects is that you are planning bigger architectural changes of the agents. This is great and harmonizing the agents obviously has both customer and maintenance benefits. But it also meant that we need to get involved more to review larger PRs and think through the changes. This caused the process to get stuck on our side, since the team is currently focused on other priorities.

It seems to me that there’s been too much focus on my larger PR’s and using them as an excuse for this experience. Yes, I have contributed some big ideas and a few large PR’s, but I’ve actually contributed more smaller PR’s, and I’ve had mixed results with them.

If we look at, say, #227 (mine), we can see that a somewhat straightforward PR sat there for approx 152 days. It would have sat there longer if I didn’t prod.

#255 (not mine) was a slam-dunk PR IMHO. 297 days. It would have sat there longer if I didn’t prod.

#166 (mine) was a simple incremental improvement that just needed a little polish from Sven. 251 days. It would have sat there longer if I didn’t prod.

At present, the oldest open PR, #52 (not mine) suggests adding a single line to a file. This is at 757 days and counting.

When I closed #28, it was 782 days old. #52, a single line change, may soon be older than #28, a 3.1k line change. How stuck on other priorities are you guys that you can’t make a call on whether a single line is added to a file or not? I would like to repeat my earlier statement: this isn’t a good look for tribe29.

Looking at the queue now, it looks to me like 44 out of 50 open PR’s are a year old or more. When each of these were opened, they may or may not have proposed reasonable ideas and/or code. But because they’ve been ignored and left to stagnate for so long, most of them have likely either been surpassed, diverged significantly away from, or otherwise made irreconcilable.

And this has been the case with the PR’s that I closed. For example, #167, which sat unattended before different code that covered the same goal was committed separately. 461 days after that PR was first opened with the net result being duplicated effort.

And that’s free effort that tribe29 is passively choosing to throw away.

It should be abundantly clear by now that PR’s are falling through the cracks, irrespective of size or complexity.

Solution discussion:

Far be it from me to tell you how to do your own jobs. I don’t have any idea what your internal culture is like or what your current processes are, so with that lack of context in mind, I would suggest something like:

In the immediate term, maybe have a PR “spring clean” where the tribe29 team crunches through as many outstanding PR’s as possible.
Going forward, set a maximum age that a PR can possibly be. Let’s say, 160 days to start.
Put monitoring on the PR queue. I’m sure you can find a monitoring system somewhere
Set a Warning alert on aged PR’s with the threshold at something like 120 days. The idea is to bring ageing PR’s back to the forefront for whoever is looking after the PR queue, and the increased attention should ideally move the PR towards either merging or closing.
Set a Critical alert on aged PR’s, with the threshold at something like 140 days. This should invoke immediate and prioritised intervention. You’ve got 20 calendar days to figure it out and get the PR done, one way or another.
These hypothetical thresholds as described are simplistic and indexed from the point that a PR is opened, eventually you may want to change that to be indexed based on the last update within the PR… Or just take that approach from the start. Or take both metrics into account. Your call.
As this process gets properly bedded in, the thresholds can come down.
Or, if dogfooding doesn’t sound appealing, maybe something like actions/stale may be useful.

No matter what solution you end up with, the best time to engage with a PR is while it’s still fresh in its author’s mind.

CONTRIBUTING.md states:

If you would like to make a major change to Checkmk, please create a new topic under the Product Ideas category in the Checkmk Forum so we can talk about what you want to do. Somebody else may already be working on it, or there are certain topics you should know before implementing the change.

We love to work with community contributors and want to make sure contributions and time investments are as effective as possible. That’s why it is important to us to discuss major changes you might be planning in order to jointly agree on the best solution approach to the problem at hand.

So by posting this thread here (605 days ago), I was following the official guidance. Yet, much like Github, I’m not seeing much engagement from tribe29 here. This forum is not exactly flooded with posts reading “Great idea! We’ll put it on the roadmap!” or "That’s a good idea, but we’re going in a different direction because of xyz… " or “Interesting idea, have you considered abc…” etc

Solution discussion:
I mean, this one is on you guys. Just like Github, you have a community here offering up ideas, code and assistance freely. The smart thing to do is to invest a bit of time figuring out how to leverage the community so that it’s picking up some of your workload.

For example, you guys obviously have an internal issue tracking system. Within that will be a bunch of issues. A sub-set of those will likely be commercially sensitive, but the rest will just be generic. Why not spend some time developing a way to mirror those issues into the Github issue tracker and see what the community contributes? Maybe send out free tribe29 merch to authors of exceptionally good or useful commits, and/or have an unofficial “community commit of the month”. I’m sure @fayepal would have some other great engagement ideas to wedge in - open up the opportunity and let her use her talents. Let the community pick up some of the load, and in doing so, free up more time for yourselves.

Now obviously that’s a rose-tinted, optimistic ideal. But it’s a goal at least; better than nothing, better than the status quo, and at least something that can be worked towards.

And then there’s this. So on the one hand I’m being effectively told that I should commit incrementally (something I demonstrably already have done), and on the other hand I’m being effectively told to not bother committing at all… and precisely at the point that I was about to open an incremental PR…

Solution discussion:
This one is easy. Change the definition of a bug to include “anything that annoys Rawiri”

But seriously, this should be solved by straightening out your PR handling processes and offloading some workload to the community, as described above.

And on top of all of that… One of the biggest issues when developing and submitting a PR or suggesting a major architectural change is that this is often done with virtually no visibility or context of checkmk’s development direction. You guys keep that locked away pretty tight, and it’s to your detriment.

Solution discussion:

Again, this is on you guys. It would be great to have access to something like a technical roadmap that lays out the forthcoming project goals. Product Ideas can be accepted from this forum and elsewhere into the roadmap - meaning that it’s a constantly evolving “living document”.

From my experience, the only ideas I have received about coding direction has been from passing comments in Github. If I have access to a reference document that defines what direction the codebase is pointed in, I can maybe contribute towards the roadmap’s goals.

Without that kind of knowledge-share between tribe29 and its community, we’re all just thrashing about in the dark. For example, what is the intent behind cmk-agent-ctl?

The *nix agent code has also diverged off into a direction that I probably wouldn’t have taken it, and, without wanting to disrespect any of the recent contributors at all or their work, it looks to me like it’s being coded into a bit of a corner. This means that any architectural change is going to be increasingly difficult and painful, and it’s what I was hoping to avoid with the primary suggestion that I made with this thread.

Solution discussion:

Well, to be clear: I am obviously not the be-all, *nix agent Super-Jesus, and I certainly don’t think or expect the agent code to be completely my way. The best I can do is contribute my ideas. For my part, I have to decide whether I want to reengage, you guys have to decide whether you’re going to

a. accept contributions from me and
b. open up and let the community know what direction you’re headed in and
c. up your PR processing game. If I reengage, I don’t want simple PR’s sitting there for hundreds of days. That blocks me from making further contributions and is a major disincentive. The same may or may not be true for other contributors.

If I do reengage, and you guys do manage to process your PR’s in a timely manner, then we can thrash through a bunch of incremental PR’s very quickly.

Finally, the agent scripts serve an important role within checkmk’s functionality. It has seemed to me that there has been limited interest from tribe29 in getting the agent’s fundamentals stabilised and then building the rest of the product from there. The agent code appears to have become a second class citizen to more important things, like dark mode themes.

Solution discussion:
Well, this is totally on you guys to decide what you want to prioritise. FWIW I think that you’re 3-4 years behind on where the *nix agents should be.

What I would propose – if you are still interested to reengage – is to actually collaborate more closely initially to make the collaboration more effective:

Start this effort of with a virtual planning session (video or telco) with you and members of our development team.

Align on the current state, changes in our master branch etc.

Jointly sketch out how to evolve the agents architecturally and pitfalls to watch out for based on our development and support experience across different customer types

Align on how to make the PR process productive for both sides (how to split them etc)

Help us understand what is important for you as a major contributor

Once this step (probably a bit unusual for an open source project) is done, it is much easier to shift to asynchronous mode again.

I am still undecided about whether or not I want to reengage.

In the meantime, I will have to politely decline the offer for a planning session. Firstly: You guys are in Germany (I hope this isn’t news to you ), and I’m in New Zealand. Our timezones just don’t map nicely in a way that I can factor in around my day-job and family time. Also, I suspect that this lengthy post might be taken a bit on the nose and you guys might not want to talk to me for a while, if ever.

Secondly, this isn’t (and shouldn’t be) about me: I don’t expect special treatment. In my view, and as I stated above, the best outcome here is for tribe29 to leverage its community in a way that takes some of the workload off tribe29. Instead of trying to bring me closer, try being closer to your community.

Cheers

Rawiri

elias.voelker · October 29, 2021, 12:20pm

Hi Rawiri,

thanks for the time you took to answer and comment on Jan’s post.
Not my place to comment on its entirity, but just wanted comment on this point:

Take a look here (updated pretty regularly):

We also make much of our roadmap transparent during our conference and publish the videos and presentations shortly afterwards: The Checkmk Conference #10

That may be a good place to start to see what goals we’re working towards.

Best
Elias

rawiriblundell · October 29, 2021, 10:07pm

Hi Elias,
thanks for that. That’s a bit higher level than I was meaning - I was being intentionally specific when I said “technical roadmap”. For example, if I look at the descriptions on a couple of entries that are relevant to my interests:

Various performance improvements to the AIX agent.

and

Significant enhancements to the MS-SQL check.

These descriptions are not useful or helpful.

But, to your credit, it is more than nothing, and it can always be built upon.

So with the same caveat as before: I don’t know what you’re currently using internally to do this, so the following is described with that lack of inside knowledge and context in mind.^[1]

To start, create a description template that is used for creating the descriptions. This is so that when someone clicks on [Read More], they’re actually going to be reading something meaningful. Who, What, Where, When, Why and How is usually a good starting point for such a template if you have no other ideas for one.

Next, let’s assume, for the sake of this hypothetical exercise, that you guys begin using Github issue tracking as part of engaging more collaboratively and openly with the community. Well, then with each entry on this roadmap, you can then simply add a link underneath [Read More] to the related Github issue. Name the link something cute, like “[Want to help?]”

This way, anyone looking at this roadmap has the ability to see a high level overview, and if they’re more interested at a technical level, they can simply click a link and get right into where the coding is happening.

Now, within either/both your internal issue tracker and our hypothetical Github issue tracker, start tagging issues, using tags like “Roadmap-Consideration”, “Roadmap-Planned” and “Roadmap-In-Progress”. Based on these tags you can automate the building of the roadmap. Extract titles and “Read More” descriptions straight from the issue ticket, and now you’re working smarter, not harder™.

Again, all of that is offered with the aforementioned caveat

Cheers

Rawiri

^{[1] See, it’s interesting how this exercise parallels writing code based PR’s for checkmk. I’m just flying blind, suggesting the best I can off what little assumptions I have at hand and hoping I hit a useful mark somewhere}

foobar · November 1, 2021, 12:55pm

@rawiriblundell
Thank you for taking the time and addressing all your points so detailed! You’re nailing it!
I will not reply to the GIT/PR discussion as there is really not more to add.
I heard some others feel the same way - so lets see whats gonna change

But you addressed another very good point! - ROADMAP, actually “technical raodmap”
We also saw the Roadmap page, some of the, lets call them ‘tasks’, have more meat on the bone then others.

Two very good examples with no meat on the bone. Especially as the topic of this threat is “nix agents”,
we would be super curious to see what will be the performance improvements and if its not again pushing the Agent even more into a dead end as you mentioned above.
@andreas-doehler also had very good ideas about how to improve the agent speed and how he works
which brings me to the next point “Github issue tracker”, where this would/could work out for example.

Nice suggestion
"Roadmap-Planned” and “Roadmap-In-Progress” would give at least the option for people to bring inputs, as we often noticed the “topic” sounded nice, but the final product was not as expected or had or created some issues in some environments, where previously wasn’t thought about.

In addition to the points you mentioned.
What I personally find sad is that it feels, from many of the brilliant suggestions written and upvoted here, not even the top ones feel to be cerry picked by T29

Same as for GitHub, if nothing gonna happen, then people will stop to submit, read and upvote ideas as they feel nothing is happening anyway and in tehre way, not taken part of the community anymore.
For me it feels like @andreas-doehler & @r.sander are engaging the most in this part of the forum (Andreas for sure in any part o the forum ) but as it is about ideas and possible roadmaps, there should be someone who is, not only asnwering, but also questioning many of this posts to find out the root problem people facing while using CMK and making it at the end a better product for all whiile solving many of them.

Cheers

foobar · November 10, 2021, 12:30pm

github.com/tribe29/checkmk

Add MK_OSSTR var handling to nix agent scripts

tribe29:master ← rawiriblundell:add_MK_OSSTR

opened 11:43AM - 09 Nov 21 UTC

rawiriblundell

+333 -1

Hi, this PR adds a new env var (or constant, really) `MK_OSSTR` to all *nix age…nt scripts. This is a simple foundational step towards converging the *nix scripts together. **MK_OSSTR** Rationale: In order to get the same or similar information from different *nix variants, different tools sometimes need to be used, or the same tool but with different args (see: e.g. `df`). This variable makes it easy to select the appropriate behaviour when such a selection is required e.g. ``` case "${MK_OSSTR}" in (aix) some_aix_tool -with -aix -specific -args ;; (linux) generic_tool --linux --args ;; (freebsd) generic_tool -f -r -e -e -b -s -d -a -r -g -s ;; (solaris) die_oracle_die | grep something | sed 's/silly baroque/solaris/g' ;; esac ``` This var can also be integrated into section headers e.g. `<<<ps:${MK_OSSTR}>>>` or into JSON output whenever that journey finally gets underway, if ever. This var allows the agent information block (usually within `section_checkmk()`) to be standardised i.e. `echo "AgentOS: ${MK_OSSTR}"` This is not the only method to figure out the best way to glean information from different OSes and in some cases may actually be a naïve approach. Regardless, this variable is still an important enabler for converging the agent scripts together. Bug fix required? The AIX agent defines `set_variable_defaults()` but never calls it. This PR fixes that. :) Cheers Rawiri

looked simple and like a good start, but seems like there was no communication - so closed again?

@jan.justus ?
So even after this thread and 20 days past, there was no communication with a developer who ist constantly and desperately trying to improve the CMK’s agents to simplyfy and prepare them for future improvements (as other tools overtake the once good and solid agent) - still no priority on CMK side?

martin.hirschvogel · November 10, 2021, 2:47pm

Hey,
the entire product ideas topic is something, which we decided today we will restructure and try a new approach as the current one is not working as we originally planned.

To the pull request: we clearly state on our GitHub that we currently focus only on pure bugfixes. This is something, which we feel currently comfortable to handle.

If you take a look at recently closed PRs, you will see that many have been accepted and merged:
PR 407: Fix Bug when ora_pmon Process of Oracle DB is not running
PR 406: [agent_ipmi_sensors] adds -I for ipmitool
PR 405: Update check_mk_agent.linux
PR 404: FreeBSD agent: initialize spooldir (bugfix)
PR 403: added missing agent version check string
PR 402: Fix AttributeError in Virtual Host Tree with old Hosts
PR 401: Fixed safenet_hsm and safenet_ntls checks not working on newer Thales HSMs.
PR 400: Update kaspersky_av_client.vbs
I could continue like this to also showcase, that we care a lot about pull requests. But, with each PR we close, a new one is being created Which is great! But there is only so much, we can do.
We ask for your understanding for this approach.

Don’t forget, until mid of 2018, the checkmk.git was only available on git.mathias-kettner.de with no interaction opportunity. Back then, I pushed our dev team to move towards GitHub, because that’s what an open source project should do. While the dev team had their concerns regarding exactly the discussion we have here, I ignored them. So don’t be mad at them, be mad at me. The alternative is to go back to the old ways, but we think the current path is better with us actively communicating what we can handle, and what we can’t.

Cheers, Martin

foobar · February 1, 2022, 8:56am

@rawiriblundell
I’m curious - was there any direct, productive exchange with you and Tribe29 about the topic discussed here? Was tehre a meeting?
And if so, whats now the roadmap?

Last post here was from November

Cheers