How does checkmk work in very bandwidth-constrained environments?

I’m evaluating monitoring software and checkmk looks pretty good. Our setup is a mix of public cloud we operate and on-prem servers we deploy to, but do not operate. The servers are very bandwidth constrained and their network isn’t reliable.

How does checkmk fare in these kinds of environments? Ideally I’d have:

  • Extremely small metrics and alerting payloads sent over the network
  • Buffered metrics collection that can be resumed when the network comes back
  • The ability to store and analyze metrics, and manage alerts, agents, etc. on the server AND ship some of those metrics to our cloud for analysis and collation. Is that possible?

Thanks in advance for any insights you may be able to share.

Hallo,
checkmk should work for you.
Have a look in the manual to learn how the agent and checks are working.
There are tons of options to manage which checks and how often they should be executed.
If possible use the enterprise edition to have access to the agent bakery.
Ralf

Checkmk has distributed monitoring where a monitoring site (which collects metrics and state history) runs close to the monitored systems.

Also the API requests to the public cloud providers are streamlined to not use too much data.

As mentioned by r.sander use Distributed monitoring and place the remote monitoring server close to the monitored nodes. This way you only transport limited information between your central monitoring server and the remote monitoring server.
E.g. we have a remote monitoring server in each Europe DC of AZURE and the central monitoring server is on prem.

This should be possible with either REST API of livestatus queries or a combination of both.

1 Like

Much has been said already, the Livestatus Proxy is something should take a look at (Checkmk Enterprise only, though): Distributed monitoring - Scaling and distributing Checkmk

Oh, and welcome to the forum!

Thank you all for your answers. We’re going to do a PoC of checkmk raw and see if it works for us.

Best of luck with the PoC!

Keep in mind that some of the things mentioned in the linked article (e.g. Livestatus Proxy, InfluxDB integration) are Enterprise only.

HI,
There are upcoming features of Checkmk that allows for the agent to push it’s data (today checkmk talks to the client - not ideal over private cloud as you have to expose checkmk to the public internet (unless you have VPN/MPLS/etc)

There might also be a proxy server, that can deal with this.

Having a separate site (distributed monitoring) is not ideal either, as you have to expose ports once again (but this time more ports) - and live status works in real-time. So if you want to see metrics from your remote on-prem sites that info will be retrieved over your slow-wan connection and it will take any bandwidth that’s available. If the site goes down you will not get any metrics at all.

Thanks all for the information.

Follow-up question. I have a PoC (Enterprise) running on a machine. With about 20 metrics, a single run will consume ~120KB to send to the dashboard host. With a single metric, that run consumes ~110KB. I didn’t set up any encryption. So there seems to be ~100KB of overhead.

When I run check_mk_agent manually, the output to stdout is pretty large, even with just one local metric. I don’t know if that’s related to the 100KB overhead on the network.

Is there a way to reduce that overhead so a run only sends a tiny payload? My deployment environment is extremely bandwidth constrained and every byte counts.

Edit: we did a packet capture and compared it to the output of check_mk_agent and it is indeed the same, i.e. the agent is sending up data I explicitly do not want, e.g. file system metrics. This is after I disabled those services in the dashboard and applied changes, and restarted the check mk service.

Can those be removed?

Edit: the answer is yes! For future reference: any default check can be skipped by setting a given MK_SKIP_ * variable to true in a file named /etc/check_mk/exclude_sections.cfg.

An example is here:

1 Like

You could write your own agent script that outputs just the sections you really need and may even have a smaller footprint.

1.) If you have enterprise you don’t have to edit the agent, there is a rule for disabling sections.
2.) In 2.1 the agent output will be compressed. This should have a huge performance boost as the agent output is only text based. 100k could easily be 1k

The “disabled services” are indeed misleading as one would think they would not run, but they are just disabled, if you run a full scan you will see their data, but the services are marked as disabled.

1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.