Per process CPU monitoring

Hmm. If there’s a process that’s maxing resources (we assume CPU) and you’re unable to ssh to an affected instance, then the ability to monitor said instance may also be impaired. If an affected system is no longer externally responsive to ssh, then it’s probably likely that you won’t be able to pull much, if any monitoring information from it either.

Also, high CPU isn’t often that fatal in my experience. It’s usually more likely that it’s memory.

So your best bet to start with IMHO is to either learn the dark arts of tuning the kernel oom-killer, or look at a userspace oom-killer daemon like earlyoom or oomd.

If problems persist, then something like monit is usually the answer.

What you then want to look for is some kind of logging that indicates what was killed or restarted and when. Gather that information however you can, collate it and see if you can start to come up with patterns.

Per-process metrics and graphs etc are cool and all, and checkmk could do better there, but I don’t think it’s the best way to diagnose your issue.