[Help Needed] Troubleshooting High Memory Usage and Disk I/O Issues on Ubuntu Server CheckMK Raw 2.3

toanpk · August 8, 2024, 3:59am

CMK version:
OMD - Open Monitoring Distribution Version 2.3.0p9.cre

OS version:
Ubuntu 22.04.2 LTS

Error message:
Service STALE status

Resources
8 cores vCPU, 16GB RAM, 250GB SSD

Hosts and services monitoring
200 host and 2300 service with checkmk_agent installed

Hi everyone,

I’m experiencing significant issues with high memory usage and disk I/O on my Ubuntu server. Here are the details and steps I’ve taken to diagnose the problem:

Problem Description

System: Ubuntu Server
Issue: The system frequently runs out of memory, triggering the OOM-Killer. Additionally, there’s high disk I/O which appears to be related to the memory issues.
Impact: Critical processes are being killed, and system performance is severely affected.

Steps Taken to Diagnose the Problem

Collected Disk I/O Data:

Using sysstat, I collected disk I/O data and stored it at /var/log/sysstat/.

Analyzed Data with pidstat:

Ran pidstat to identify processes with high disk I/O.
Example output:

10:47:01 PM   998      9408    341.35     27.76      0.00       0  rrdcached
10:47:01 PM     0       503      0.00     20.57      0.00       0  jbd2/dm-0-8
10:47:01 PM   107       904    454.12     19.50      0.00       0  rsyslogd

3.Checked Kernel Logs:

Used journalctl -k to examine kernel logs for memory issues.
Found multiple instances of the OOM-Killer being triggered:

kthreadd invoked oom-killer: gfp_mask=0x102dc2(GFP_HIGHUSER|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0

4.Identified High Memory Usage Processes:

Created a script to log processes using the most memory.
Setup a cron job to run the script every 5 minutes and append the results to a log file.

Kernel Log Analysis

From the kernel log analysis, I found the following entries during a critical period:

[321344.087568] kthreadd invoked oom-killer: gfp_mask=0x102dc2(GFP_HIGHUSER|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0
[321344.087569] CPU: 0 PID: 2 Comm: kthreadd Tainted: P           OE     5.4.0-80-generic #90-Ubuntu
[321344.087570] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
...
[321344.087575] Out of memory: Kill process 11922 (python3) score 1183 or sacrifice child
[321344.087577] Killed process 11922 (python3) total-vm:123456kB, anon-rss:12345kB, file-rss:67890kB, shmem-rss:0kB

Observations

OOM-Killer Activity:

The OOM-Killer has been activated multiple times, indicating severe memory pressure.
Critical processes like python3 and nagios are being killed.

Disk I/O Correlation:

High disk I/O is observed during periods of high memory usage.
Processes like systemd-journal, rrdcached, and npcd show significant I/O activity.

Log Analysis:

Frequent log entries from systemd-journal during memory pressure periods, suggesting logging overhead contributes to I/O.

Actions Taken

Increased swap space to mitigate immediate memory pressure.
Optimized systemd-journald configuration to reduce logging overhead:

[Journal]
SystemMaxUse=500M
SystemMaxFileSize=100M
SyncIntervalSec=5m

Request for Help

I’m seeking advice on further steps to diagnose and resolve these memory and I/O issues. Specifically, I would appreciate guidance on:

Identifying Root Causes:

Best practices for identifying processes or configurations that lead to high memory usage.
Tools or methods for deeper analysis of memory and I/O usage.

Optimization Tips:

How to optimize services like nagios, rrdcached, and npcd for better memory management.
Recommendations for logging configurations to minimize I/O impact.

Preventive Measures:

Strategies for preventing OOM-Killer activations.
Effective monitoring solutions to alert before reaching critical memory thresholds.

Any insights or recommendations would be greatly appreciated. Thank you!