[Help Needed] Troubleshooting High Memory Usage and Disk I/O Issues on Ubuntu Server CheckMK Raw 2.3

CMK version:
OMD - Open Monitoring Distribution Version 2.3.0p9.cre

OS version:
Ubuntu 22.04.2 LTS

Error message:
Service STALE status

Resources
8 cores vCPU, 16GB RAM, 250GB SSD

Hosts and services monitoring
200 host and 2300 service with checkmk_agent installed

Hi everyone,

I’m experiencing significant issues with high memory usage and disk I/O on my Ubuntu server. Here are the details and steps I’ve taken to diagnose the problem:

Problem Description

  • System: Ubuntu Server
  • Issue: The system frequently runs out of memory, triggering the OOM-Killer. Additionally, there’s high disk I/O which appears to be related to the memory issues.
  • Impact: Critical processes are being killed, and system performance is severely affected.

Steps Taken to Diagnose the Problem

  1. Collected Disk I/O Data:
  • Using sysstat, I collected disk I/O data and stored it at /var/log/sysstat/.
  1. Analyzed Data with pidstat:
  • Ran pidstat to identify processes with high disk I/O.
  • Example output:
10:47:01 PM   998      9408    341.35     27.76      0.00       0  rrdcached
10:47:01 PM     0       503      0.00     20.57      0.00       0  jbd2/dm-0-8
10:47:01 PM   107       904    454.12     19.50      0.00       0  rsyslogd

3.Checked Kernel Logs:

  • Used journalctl -k to examine kernel logs for memory issues.
  • Found multiple instances of the OOM-Killer being triggered:
kthreadd invoked oom-killer: gfp_mask=0x102dc2(GFP_HIGHUSER|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0

4.Identified High Memory Usage Processes:

  • Created a script to log processes using the most memory.
  • Setup a cron job to run the script every 5 minutes and append the results to a log file.

Kernel Log Analysis

From the kernel log analysis, I found the following entries during a critical period:

[321344.087568] kthreadd invoked oom-killer: gfp_mask=0x102dc2(GFP_HIGHUSER|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0
[321344.087569] CPU: 0 PID: 2 Comm: kthreadd Tainted: P           OE     5.4.0-80-generic #90-Ubuntu
[321344.087570] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
...
[321344.087575] Out of memory: Kill process 11922 (python3) score 1183 or sacrifice child
[321344.087577] Killed process 11922 (python3) total-vm:123456kB, anon-rss:12345kB, file-rss:67890kB, shmem-rss:0kB

Observations

  1. OOM-Killer Activity:
  • The OOM-Killer has been activated multiple times, indicating severe memory pressure.
  • Critical processes like python3 and nagios are being killed.
  1. Disk I/O Correlation:
  • High disk I/O is observed during periods of high memory usage.
  • Processes like systemd-journal, rrdcached, and npcd show significant I/O activity.
  1. Log Analysis:
  • Frequent log entries from systemd-journal during memory pressure periods, suggesting logging overhead contributes to I/O.

Actions Taken

  • Increased swap space to mitigate immediate memory pressure.
  • Optimized systemd-journald configuration to reduce logging overhead:
[Journal]
SystemMaxUse=500M
SystemMaxFileSize=100M
SyncIntervalSec=5m

Request for Help

I’m seeking advice on further steps to diagnose and resolve these memory and I/O issues. Specifically, I would appreciate guidance on:

  1. Identifying Root Causes:
  • Best practices for identifying processes or configurations that lead to high memory usage.
  • Tools or methods for deeper analysis of memory and I/O usage.
  1. Optimization Tips:
  • How to optimize services like nagios, rrdcached, and npcd for better memory management.
  • Recommendations for logging configurations to minimize I/O impact.
  1. Preventive Measures:
  • Strategies for preventing OOM-Killer activations.
  • Effective monitoring solutions to alert before reaching critical memory thresholds.

Any insights or recommendations would be greatly appreciated. Thank you!