Recurring Python tasks causing high CPU Load

CMK version:
2.0.0p20
OS version:
SLES 12.5

Recently CPU Load (average last 15) on my CMK host is up significantly. Watching Top on the console I see a large number of Python 3 tasks poping up ever minute. It looks like this:

top - 19:26:08 up 23 days,  9:16,  1 user,  load average: 11.12, 11.52, 10.32
Tasks: 297 total,  45 running, 252 sleeping,   0 stopped,   0 zombie
%Cpu(s): 92.4 us,  7.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:   4031712 total,  3462604 used,   569108 free,   198388 buffers
KiB Swap:  2103292 total,   104704 used,  1998588 free.  1859820 cached Mem


 4010 monitor   20   0   46184  22452   8404 R 2.658 0.557   0:00.13 python3    
 4086 monitor   20   0   40440  18372   7740 R 2.658 0.456   0:00.08 python3    
 4003 monitor   20   0   46160  22344   8404 R 2.326 0.554   0:00.13 python3    
 4006 monitor   20   0   46184  22452   8400 R 2.326 0.557   0:00.12 python3    
 4008 monitor   20   0   46160  22516   8576 R 2.326 0.558   0:00.12 python3    
 4012 monitor   20   0   46160  22580   8640 R 2.326 0.560   0:00.12 python3    
 4013 monitor   20   0   46160  22536   8600 R 2.326 0.559   0:00.12 python3    
 4014 monitor   20   0   46156  22640   8700 R 2.326 0.562   0:00.12 python3    
 4018 monitor   20   0   46160  22548   8612 R 2.326 0.559   0:00.12 python3    
 4031 monitor   20   0   46216  22648   8712 R 2.326 0.562   0:00.12 python3    
 4032 monitor   20   0   45880  22472   8568 R 2.326 0.557   0:00.12 python3    
 4033 monitor   20   0   46160  22384   8452 R 2.326 0.555   0:00.13 python3    
 4034 monitor   20   0   45884  22696   8792 R 2.326 0.563   0:00.12 python3    
 4035 monitor   20   0   46160  22388   8452 R 2.326 0.555   0:00.12 python3    
 4036 monitor   20   0   45864  22456   8600 R 2.326 0.557   0:00.12 python3    
 4037 monitor   20   0   45884  22520   8616 R 2.326 0.559   0:00.12 python3    
 4038 monitor   20   0   45884  22512   8608 R 2.326 0.558   0:00.12 python3

Looking at the Graph for CPU load I see that this started 2 days ago and jumped significantly yesterday evening. I am not aware of any changes to the system that might have caused this.

I was running CMK 1.6 on this server until a few weeks ago. The CPU Load was typically under 0.5. It has been up slightly after upgrading to CMK 2.0, but still well under 1.0.

I would be grateful for pointers on finding the root of this behavior.

Hi,
please press “c” while running top. It will reveal to command line of those python3 processes.

1 Like

Hi,

Thanks for your response. The python processes all look like this:

/omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-JTr

I have 88 files in that directory. Two for each host. Are these defining SNMP checks?

Hi these are defining not only SNMP hosts, but all hosts. But it doesn’t really matter, as two files per host are totally expected.
Can you post a screenshot of your top, after you pressed “c” (command line) and “1” (1 line per core with CPU details).

Good morning!
I hope this reveals more:

top - 07:50:14 up 1 day, 12:22,  1 user,  load average: 5.13, 4.46, 4.35
Tasks: 256 total,  23 running, 232 sleeping,   0 stopped,   1 zombie
%Cpu0  : 68.4 us, 30.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
KiB Mem:   4031712 total,  3701692 used,   330020 free,   157772 buffers
KiB Swap:  2103292 total,    45824 used,  2057468 free.  2602144 cached Mem


20058 monitor   20   0   51404  27724   9716 R 4.934 0.688   0:00.20 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-MH-Pforte                                                                                                                                                             
20043 monitor   20   0   55952  30328   9728 R 4.605 0.752   0:00.24 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/Benjamin                                                                                                                                                                  
20033 monitor   20   0  139580  37988  10952 S 3.947 0.942   0:00.38 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/Arbel                                                                                                                                                                     
20041 monitor   20   0   57180  31560   9724 R 3.947 0.783   0:00.30 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/T-SBC                                                                                                                                                                     
20173 monitor   20   0   45584  22064   8552 R 3.947 0.547   0:00.12 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/USV-2                                                                                                                                                                     
20038 monitor   20   0  139584  37832  10792 S 3.289 0.938   0:00.37 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/Golan                                                                                                                                                                     
20013 monitor   20   0   60132  34836   9808 R 2.632 0.864   0:00.35 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/Yaffo                                                                                                                                                                     
20015 monitor   20   0   56804  31432   9568 R 2.632 0.780   0:00.34 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-JTr                                                                                                                                                                   
20016 monitor   20   0   56668  31456   9568 R 2.632 0.780   0:00.34 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-JRK                                                                                                                                                                   
20034 monitor   20   0   60428  35188   9804 R 2.632 0.873   0:00.35 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/Haifa                                                                                                                                                                     
    1 root      20   0   42160   5664   4268 S 2.303 0.140   2:38.97 /usr/lib/systemd/systemd --switched-root --system --deserialize 23                                                                                                                                                                                                             
20011 monitor   20   0   57064  31708   9740 R 2.303 0.786   0:00.33 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-HGG                                                                                                                                                                   
20012 monitor   20   0   57624  32180   9740 R 2.303 0.798   0:00.33 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-JFre                                                                                                                                                                  
20031 monitor   20   0   57512  32068   9876 S 1.974 0.795   0:00.32 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/Asaph                                                                                                                                                                     
20030 monitor   20   0   57112  31820   9876 S 1.645 0.789   0:00.30 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-JFri-2                                                                                                                                                                
20352 monitor   20   0   33824  12996   5664 R 1.645 0.322   0:00.05 python3 /omd/sites/monitor/share/check_mk/agents/special/agent_vsphere -u vmonitor -s=6I%=B5D0&7jf3v!+CXgZ -i hostsystem,datastore,counters --direct --hostname Golan -P --spaces cut --no-cert-check xxx.xxx.xxx.xxx                                                                 
20029 monitor   20   0   57200  31776   9824 S 1.316 0.788   0:00.30 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-JWh                                                                                                                                                                   
20032 monitor   20   0   57064  31688   9824 S 1.316 0.786   0:00.29 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/zSW-MH                                                                                                                                                                    
20357 monitor   20   0   33424  12844   5756 R 1.316 0.319   0:00.04 python3 /omd/sites/monitor/share/check_mk/agents/special/agent_vsphere -u vmonitor -s=6I%=B5D0&7jf3v!+CXgZ -i hostsystem,datastore,counters --direct --hostname Arbel -P --spaces cut --no-cert-check xxx.xxx.xxx.xxx                                                                 
    7 root      20   0       0      0      0 R 0.329 0.000   1:05.35 [ksoftirqd/0]                                                                                                                                                                                                                                                                  
    8 root      20   0       0      0      0 R 0.329 0.000   1:38.25 [rcu_sched]                                                                                                                                                                                                                                                                    
 1289 root      20   0    8776   1688   1628 S 0.329 0.042   0:00.28 /usr/sbin/xinetd -stayalive -dontfork                                                                                                                                                                                                                                          
 1601 monitor   20   0   64264   9984    880 S 0.329 0.248   5:26.79 /omd/sites/monitor/bin/redis-server *:0                                                                                                                                                                                                                                        
20039 monitor   20   0   58432  33024   9948 S 0.329 0.819   0:00.28 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/Ephraim                                                                                                                                                                   
20409 monitor   20   0   22084   7960   4624 R 0.329 0.197   0:00.01 /omd/sites/monitor/bin/python3 /omd/sites/monitor/var/check_mk/core/helper_config/latest/host_checks/Eran                                                                                                                                                                      
32668 monitor   20   0  697304  12268   5268 S 0.329 0.304   1:29.15 /omd/sites/monitor/bin/nagios -ud /omd/sites/monitor/tmp/nagios/nagios.cfg                                                                                                                                                                                                     
    2 root      20   0       0      0      0 S 0.000 0.000   0:00.00 [kthreadd]                                                                                                                                                                                                                                                                     
    4 root       0 -20       0      0      0 S 0.000 0.000   0:00.00 [kworker/0:0H]                                                                                                                                                                                                                                                                 
    6 root       0 -20       0      0      0 S 0.000 0.000   0:00.00 [mm_percpu_wq]                                                                                                                                                                                                                                                                 
    9 root      20   0       0      0      0 S 0.000 0.000   0:00.00 [rcu_bh]                                                                                                                                                                                                                                                                       
   10 root      rt   0       0      0      0 S 0.000 0.000   0:00.00 [migration/0]                                                                                                                                   

You just have one CPU. That’s probably a bad idea for any monitoring system, as a monitoring is somewhat massive-parallel. So I’m not surprised about the load.

Hi,
That was a change I made some weeks ago when doing VSphere tuning. The advice I received was to reduce all VMs to 1 CPU to minimize CPU over booking. But it was understood that you need to watch and see what works. For a week or two things looked fine and then started to go sour. That is why I did not associate it with the CPU count right away.
Just this afternoon I needed to reboot because of a kernel update and decided to see if adding a CPU would help. It has been 2+ hours now and things are looking good. I will continue to observe the CPU Load over time and if things stay normal I will mark your post as the solution.
Thanks for your assistance!
Have a nice weekend!

I have a history as VMware consultant and I have to say: Sorry, this is a very bad advice…
Modern OS all expect to be multicore. If you just provide 1 CPU, well, you serialized your workload.

2 Likes

Your advice is much appreciated.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.