CMK version:
OMD - Open Monitoring Distribution Version 2.3.0p9.cre
OS version:
Ubuntu 22.04.2 LTS
Error message:
Service STALE status
Resources
8 cores vCPU, 16GB RAM, 250GB SSD
Hosts and services monitoring
200 host and 2300 service with checkmk_agent installed
Hi everyone,
I’m experiencing significant issues with high memory usage and disk I/O on my Ubuntu server. Here are the details and steps I’ve taken to diagnose the problem:
Problem Description
- System: Ubuntu Server
- Issue: The system frequently runs out of memory, triggering the OOM-Killer. Additionally, there’s high disk I/O which appears to be related to the memory issues.
- Impact: Critical processes are being killed, and system performance is severely affected.
Steps Taken to Diagnose the Problem
- Collected Disk I/O Data:
- Using
sysstat
, I collected disk I/O data and stored it at/var/log/sysstat/
.
- Analyzed Data with
pidstat
:
- Ran
pidstat
to identify processes with high disk I/O. - Example output:
10:47:01 PM 998 9408 341.35 27.76 0.00 0 rrdcached
10:47:01 PM 0 503 0.00 20.57 0.00 0 jbd2/dm-0-8
10:47:01 PM 107 904 454.12 19.50 0.00 0 rsyslogd
3.Checked Kernel Logs:
- Used
journalctl -k
to examine kernel logs for memory issues. - Found multiple instances of the OOM-Killer being triggered:
kthreadd invoked oom-killer: gfp_mask=0x102dc2(GFP_HIGHUSER|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0
4.Identified High Memory Usage Processes:
- Created a script to log processes using the most memory.
- Setup a cron job to run the script every 5 minutes and append the results to a log file.
Kernel Log Analysis
From the kernel log analysis, I found the following entries during a critical period:
[321344.087568] kthreadd invoked oom-killer: gfp_mask=0x102dc2(GFP_HIGHUSER|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0
[321344.087569] CPU: 0 PID: 2 Comm: kthreadd Tainted: P OE 5.4.0-80-generic #90-Ubuntu
[321344.087570] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
...
[321344.087575] Out of memory: Kill process 11922 (python3) score 1183 or sacrifice child
[321344.087577] Killed process 11922 (python3) total-vm:123456kB, anon-rss:12345kB, file-rss:67890kB, shmem-rss:0kB
Observations
- OOM-Killer Activity:
- The OOM-Killer has been activated multiple times, indicating severe memory pressure.
- Critical processes like
python3
andnagios
are being killed.
- Disk I/O Correlation:
- High disk I/O is observed during periods of high memory usage.
- Processes like
systemd-journal
,rrdcached
, andnpcd
show significant I/O activity.
- Log Analysis:
- Frequent log entries from
systemd-journal
during memory pressure periods, suggesting logging overhead contributes to I/O.
Actions Taken
- Increased swap space to mitigate immediate memory pressure.
- Optimized systemd-journald configuration to reduce logging overhead:
[Journal]
SystemMaxUse=500M
SystemMaxFileSize=100M
SyncIntervalSec=5m
Request for Help
I’m seeking advice on further steps to diagnose and resolve these memory and I/O issues. Specifically, I would appreciate guidance on:
- Identifying Root Causes:
- Best practices for identifying processes or configurations that lead to high memory usage.
- Tools or methods for deeper analysis of memory and I/O usage.
- Optimization Tips:
- How to optimize services like nagios, rrdcached, and npcd for better memory management.
- Recommendations for logging configurations to minimize I/O impact.
- Preventive Measures:
- Strategies for preventing OOM-Killer activations.
- Effective monitoring solutions to alert before reaching critical memory thresholds.
Any insights or recommendations would be greatly appreciated. Thank you!