In our test of checkmk as a monitoring replacement, we have also installed the agent on our solaris hosts. On the host we very regulary see timeouts for plugins in the Check_MK agent service:
Timed out plugin(s): solaris_prtdiag_statusWARN
This happens to both prtdiag, the zpool plugin and the agent updater. These are the plugins that are run in ‘cached’ mode, so we assume it has something to do with that. If we run the commands the agent runs manually they all work and give output quickly.
Anybody an idea what might be happening here? It fills the service log quite quickly.
PRTDIAG is known to be very slow on some solaris systems. Unfortunately, I don’t know the exact reason for this. Maybe there are some Solaris experts here who can shed more light on this issue.
If this specific check does not provide critical data for you, you can disable this section.
Since disabling sections for Solaris is still not officially supported by Checkmk 2.3 (or I may have missed it in Setup → Agents → Windows, Linux, Solaris, AIX → Agent rules → Disabled sections), you need to disable it locally on the server by adding the following to:
Thanks Lars. For the monitoring we do want the zpool and the prtdiag. The problem certainly seems not to be commands themselves, if we run them by hand, they are quick. If you look at the event failing, the timing corresponds with the interval the checks are supposed to run, leading to the suspicion the caching mechanism itself has a flaw.
At times, the prtdiag command-line execution appears to freeze. When the command runs for too long (approximately twice the configured interval, as I understand it), the agent’s caching mechanism terminates the call. This might explain the behavior you’re observing when the command is executed by the agent.
It is quite likely that when prtdiag is executed via init, a daemon process, or in a non-login shell context, it runs with different system settings or environment variables. This could potentially affect its execution time and behavior.
When executed by the agent with time, I observed that prtdiag occasionally takes significantly longer to complete. Unfortunately, the system administrator was unable to determine the root cause of this sporadic delay.
You might consider running the prtdiag command in a loop using time and a 10-second sleep interval to monitor how the execution times behave over a longer period. Since we only observed this issue on a few specific servers and the data isn’t critical for us, we decided not to investigate it further.
As I said, this is very likely not timeouts of the commands. The command is quick, we also see it for the zpool part which has no issues. If I look at the timing, the timing of the failing of the alerts is exactly the same as the interval the command is supposed to run.