Thanks for the hints so far.
So far I have have updated the trap, and had it call a cleanup function, and within the function, I had it write a status line to a log file, using the ‘date’ command, to get a date and time stamp, process ID, and the number of seconds the script had been running, and return value the function had received.
I then reused the ‘data’ command, and injected it after each block of code, using the function of the previous block as a label for that date command.
The lead me to notice a random large jump in ‘plugins’. A few more ‘date’ lines inside the ‘plugins’ code revealed that the mk_inventory plugin was the culprit. I moved it from /usr/lib/check_mk_agent/plugins/ to …/plugins/300/, and that resolved 99% of my problems.
Now on to the last 1%.
With mk_inventory in plugins/ it was being running within the check. Most of the time, this wasn’t an issue, because it would only run to completion, once every four hours, and on most systems, it could run to completion within the 60 seconds allocated to running checks. On the few combinations of systems and loads, it could not, we got a failed service check, and thus a systemd failed service warning. By moving it to plugins/300/ we moved it out of the time window, and thus, fewer failures.
Since the mk_inventory script only ran to completion once every 4 hours (default), running it cached, with a 5 minute window, is an acceptable hit.