Hi @frank-wettstein and @ChrisZIT ,
after looking at the mk_postgres.py source code, I can now give a more complete picture.
Root cause: stale system catalog statistics
The bloat query reads reltuples and relpages from pg_class. When table statistics are stale (autovacuum not keeping up), the query planner chooses a poor execution plan, leading to 7–40s runtimes.
Additional finding: no statement_timeout on the bloat query
Looking at the source: get_stats() explicitly sets statement_timeout=30000 (30s) before running. get_bloat() does not. This means a slow bloat query can hang indefinitely and block the entire agent output — which explains Chris’s agent timeouts.
Also confirmed: no section exclusion is possible
The execute_all_queries() method calls all sections unconditionally in sequence. There is no EXCLUDED_SECTIONS config option. Frank’s feature request is fully legitimate — worth submitting to https://ideas.checkmk.com.
Fixes:
1. Immediate: run ANALYZE — refreshes statistics, brings bloat query back to normal speed (Chris’s workaround, correct but temporary).
2. Recommended: async execution with cache interval
Bloat data does not need to be checked every minute. Move the plugin to an async subfolder:
mkdir -p /usr/lib/check_mk_agent/plugins/3600
mv /usr/lib/check_mk_agent/plugins/mk_postgres /usr/lib/check_mk_agent/plugins/3600/
This runs the plugin (including the bloat query) only once per hour instead of every minute, eliminating the DB load issue entirely.
3. Long-term: tune autovacuum so statistics stay fresh and the bloat query stays fast.
Try and give response …
Happy monitoring!