High Avaibility on AWS

Has anyone run Checkmk with high availability deployment in AWS? How do you manage high availability on the cloud?

The Cloud Edition only supports a single EC2. This needs to meet the high availability and will have downtime incurred. Hence, now we are deploying the raw edition into the EKS and EFS for persistent volume.

Does AWS not provide the means you need, without having to set something up yourself?
I am thinking about something like on-premise hypervisors do. They became very good at remediating hardware failures. I am sure a cloud service has to offer similar mechanisms, right?

Apart from that, you of course can run any Checkmk edition in EKS.

What exactly do you mean with HA? Checkmk is not an HA aware application, even with the appliance you only get very rudimentary “HA” like support.

1 Like

Our goal is to ensure continuous operation without any downtime and the ability to scale as needed. However, the Cloud Edition only runs on a single EC2 instance. If the CPU, memory exhausted etc then will cause the instance down.

We are trying to deploy the container version in the EKS and use EFS as the persistent volume.
How do we achieve with the CheckMk set up in the EC2 that can be horizontal scaling?

If you use EFS as PV, then only put /var there and the rest in an EBS. A NFS is not ideal for a high-performance application with a lot of small file reads & writes.
Otherwise, use EBS and the snapshot mechanisms, which should be good enough.

1 Like

By the way, you will only need horizontal scaling (use distributed monitoring by Checkmk for that), if you will be monitoring 10k+ servers or have a geographically very distributed environment.
If you have everything you are monitoring in AWS, then I wonder if you really need horizontal scaling.
You could also just put Checkmk on a big enough machine then and work with frequent snapshots for recovery. That’s probably cheaper than trying to build a lot of stuff around it.

Also, if your entire infrastructure is running in EKS, then you might not want to run Checkmk there as well. Because if you managed to break your EKS, then your monitoring won’t be running as well.

3 Likes

Thanks for replying. Vertical scaling is not the ideal way when choosing a tool for our use case. Relying on a single EC2 with a bigger spec eventually will meet issues (e.g. underlying hardware issues from AWS and requiring start and stop) and you can’t do this ad infinitum.

We use it for our Enterprise clients and have to meet the following criteria else it would be tough to recommend using a specific tool.

  1. Minimum to no downtime
  2. Able to scale horizontally
  3. High Availability Architecture

Checkmk was not build for really large enterprises and certainly not for distributed filesystems. As an example we noticed that SSD drives are to slow for us, we have been forced to move to PCI based nMVME drives due to the contact read/writes into the filesystem (.mk files etc)

Without knowing the size of your environment its impossible to recommend a strategy for any kind of HA. I’d reach out to a Checkmk partner to discuss.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.