Timeout in the solaris checkmk agent

In our test of checkmk as a monitoring replacement, we have also installed the agent on our solaris hosts. On the host we very regulary see timeouts for plugins in the Check_MK agent service:

Timed out plugin(s): solaris_prtdiag_statusWARN

This happens to both prtdiag, the zpool plugin and the agent updater. These are the plugins that are run in ‘cached’ mode, so we assume it has something to do with that. If we run the commands the agent runs manually they all work and give output quickly.

Anybody an idea what might be happening here? It fills the service log quite quickly.

PRTDIAG is known to be very slow on some solaris systems. Unfortunately, I don’t know the exact reason for this. Maybe there are some Solaris experts here who can shed more light on this issue.

If this specific check does not provide critical data for you, you can disable this section.

Since disabling sections for Solaris is still not officially supported by Checkmk 2.3 (or I may have missed it in Setup → Agents → Windows, Linux, Solaris, AIX → Agent rules → Disabled sections), you need to disable it locally on the server by adding the following to:

${MK_CONFDIR}/exclude_sections.cfg
MK_SKIP_SOLARIS_PRTG=true

Here is a list of available solaris sections you can disable by adding the corresponding variables to the exclude_sections.cfg file:

MK_SKIP_CHECKMK_AGENT_PLUGINS
MK_SKIP_JOB
MK_SKIP_DF
MK_SKIP_ZFS
MK_SKIP_ZFS_ARC_CACHE
MK_SKIP_PS
MK_SKIP_STATGRAB
MK_SKIP_CPU
MK_SKIP_UPTIME
MK_SKIP_NTP
MK_SKIP_TCP
MK_SKIP_MULTIPATHING
MK_SKIP_FILEINFO
MK_SKIP_LIBELLE
MK_SKIP_SOLARIS_FMADM
MK_SKIP_SOLARIS_SERVICES
MK_SKIP_SOLARIS_PRTG
MK_SKIP_ZPOOL

Thanks Lars. For the monitoring we do want the zpool and the prtdiag. The problem certainly seems not to be commands themselves, if we run them by hand, they are quick. If you look at the event failing, the timing corresponds with the interval the checks are supposed to run, leading to the suspicion the caching mechanism itself has a flaw.

At times, the prtdiag command-line execution appears to freeze. When the command runs for too long (approximately twice the configured interval, as I understand it), the agent’s caching mechanism terminates the call. This might explain the behavior you’re observing when the command is executed by the agent.

It is quite likely that when prtdiag is executed via init, a daemon process, or in a non-login shell context, it runs with different system settings or environment variables. This could potentially affect its execution time and behavior.

When executed by the agent with time, I observed that prtdiag occasionally takes significantly longer to complete. Unfortunately, the system administrator was unable to determine the root cause of this sporadic delay.

You might consider running the prtdiag command in a loop using time and a 10-second sleep interval to monitor how the execution times behave over a longer period. Since we only observed this issue on a few specific servers and the data isn’t critical for us, we decided not to investigate it further.

Notes from an old monitoring system I wrote:

# prtdiag sometimes never returns, wait approx 60 seconds and kill it
prtdiagoutv=`/usr/platform/$modelv/sbin/prtdiag -v & pid=$!; (pc=1;while [ $pc -le 60 ]; do kill -0 $pid 2>/dev/null;if [ $? -ne 0 ]; then break;fi;sleep 1;pc=\`expr $pc + 1\`;done;kill $pid 2>/dev/null)&`
1 Like

As I said, this is very likely not timeouts of the commands. The command is quick, we also see it for the zpool part which has no issues. If I look at the timing, the timing of the failing of the alerts is exactly the same as the interval the command is supposed to run.

For example zpool is called as:

        _run_cached_internal "zpool_status" 120 120 360 240 "echo '<<<zpool_status>>>'; /sbin/zpool status -x"

The second number is the refresh, we see a reported timeout every 2 minutes, clearing the other minute.

Prtdiag is called as

            _run_cached_internal "solaris_prtdiag_status" 300 300 900 600 \
                'echo "<<<solaris_prtdiag_status>>>"; /usr/sbin/prtdiag 1>/dev/null 2>&1; echo $?'

And that one is indeed reporting the timeout every 5 minutes, where the output still seems to be in the agent output.

I know prtdiag can have problems, but that is not the case here. Also zpool shows the same behavior

May I know which Checkmk version and Agent version you are using ?

We are running 2.4.0p5.cce. Agent version is also the latest, we are using the automatic agent updater.

Is the /var/lib/check_mk_agent/cache folder full of .cache.fail and .cache.new.PID ?

Nope, I am just seeing the .cache files there, nothing else.

We have an internal ticket on this topic as well. I will keep you posted on the outcome.

The fix will be part of this werk: Werk #18472: Restore async/cached agent plugins . It will be part of p9 which will happen next week.

1 Like

Thanks for picking up.