I’m evaluating monitoring software and checkmk looks pretty good. Our setup is a mix of public cloud we operate and on-prem servers we deploy to, but do not operate. The servers are very bandwidth constrained and their network isn’t reliable.
How does checkmk fare in these kinds of environments? Ideally I’d have:
Extremely small metrics and alerting payloads sent over the network
Buffered metrics collection that can be resumed when the network comes back
The ability to store and analyze metrics, and manage alerts, agents, etc. on the server AND ship some of those metrics to our cloud for analysis and collation. Is that possible?
Thanks in advance for any insights you may be able to share.
Hallo,
checkmk should work for you.
Have a look in the manual to learn how the agent and checks are working.
There are tons of options to manage which checks and how often they should be executed.
If possible use the enterprise edition to have access to the agent bakery.
Ralf
As mentioned by r.sander use Distributed monitoring and place the remote monitoring server close to the monitored nodes. This way you only transport limited information between your central monitoring server and the remote monitoring server.
E.g. we have a remote monitoring server in each Europe DC of AZURE and the central monitoring server is on prem.
This should be possible with either REST API of livestatus queries or a combination of both.
HI,
There are upcoming features of Checkmk that allows for the agent to push it’s data (today checkmk talks to the client - not ideal over private cloud as you have to expose checkmk to the public internet (unless you have VPN/MPLS/etc)
There might also be a proxy server, that can deal with this.
Having a separate site (distributed monitoring) is not ideal either, as you have to expose ports once again (but this time more ports) - and live status works in real-time. So if you want to see metrics from your remote on-prem sites that info will be retrieved over your slow-wan connection and it will take any bandwidth that’s available. If the site goes down you will not get any metrics at all.
Follow-up question. I have a PoC (Enterprise) running on a machine. With about 20 metrics, a single run will consume ~120KB to send to the dashboard host. With a single metric, that run consumes ~110KB. I didn’t set up any encryption. So there seems to be ~100KB of overhead.
When I run check_mk_agent manually, the output to stdout is pretty large, even with just one local metric. I don’t know if that’s related to the 100KB overhead on the network.
Is there a way to reduce that overhead so a run only sends a tiny payload? My deployment environment is extremely bandwidth constrained and every byte counts.
Edit: we did a packet capture and compared it to the output of check_mk_agent and it is indeed the same, i.e. the agent is sending up data I explicitly do not want, e.g. file system metrics. This is after I disabled those services in the dashboard and applied changes, and restarted the check mk service.
Can those be removed?
Edit: the answer is yes! For future reference: any default check can be skipped by setting a given MK_SKIP_ * variable to true in a file named /etc/check_mk/exclude_sections.cfg.
1.) If you have enterprise you don’t have to edit the agent, there is a rule for disabling sections.
2.) In 2.1 the agent output will be compressed. This should have a huge performance boost as the agent output is only text based. 100k could easily be 1k
The “disabled services” are indeed misleading as one would think they would not run, but they are just disabled, if you run a full scan you will see their data, but the services are marked as disabled.
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.