I’m working to upgrade my Nagios CheckMk setup from v1.2.8 to v1.6.0p14 and I’m trying to get my Solaris 11.1 systems upgraded from the v1.2.0p3 agent they have installed now.
The problem seems to be that the agent runs, but takes 2-3 minutes to return all it’s data, which causes the client to timeout after 110 seconds.
The 1.2.0p3 client takes under 5 seconds to return data. The problem for the 1.6.0p14 agent looks to be in the <> section, where it just takes forever to report on approximately 1200 processes.
I’ve even tried copying over the 1.4.0p38 check_mk_agent.solaris, but it too takes forever (approx 3 minutes) to return results. Which is just crazy slow.
So I did a run of the 1.4.0p38 agent with the -d switch for debugging and it generated 11,111,000+ lines of output. or 250Mb worth of debug output. Running it without the -d switch, just the plain command generated 1200 lines, but still took three minutes. Something stupid is going on in the agent script.
It looks to be this block where it loops over the output of PS and then if there are more than 100 fields (characters?) in the line, it goes into a really slow loop. So I said screw it and bumped it upto 200 to see how it ran…
And it still takes a stupid long time to run. I’m thinking I want to disable the ‘<<>>’ block just because it sucks to have it take so long to run. It’s all in this block, but the comments don’t make much sense and I think it’s trying way to hard to be smart.
Has anyone else got a solution for this?