Check_mk Agent accuracy

Hi Team,

we have check_mk 16.0.P11 raw edition installed and we are monitoring 2018 hosts and 85297 services . is there any way we can find how much time check_mk agents are taking to collect the stats from 2018 hosts and processing to check_mk ?

Regards,
Krishna

Hi @krishna505,

you should have a service at each host named Check_MK for agent based and SNMP based hosts telling you the execution time of the gathering and processing.

Additionally you can take a look on the sidebar snap-in Server Performance to get an overview of you whole system, how it’s handling anything.

Thanks @tosch @Dirk

image

I can see my check Performance stats in side bars but i’m experiencing issue with agents .

Check_mk service is stale no data has been received with in the last 1.5 check periods

with out enabling distributed monitoring is there any other ways to mitigate this issue .

How is your balance between SNMP based and agent based hosts? Your server performance is telling your system is only processing 8 host checks per second, which are round about 500 hosts a minute. So you are short 1500 hosts and this cause your stale states.
Are all 2000 hosts monitored via the server you posted the performance from?

A huge performance boost would be the using of checkmk enterprise addition because you can use the checkmk micro core which is much faster and can handle helper processes.

we are using check_mk agents based for monitoring our machines .
Yes. we are monitoring all 2018 machines from this host. if we increase Cpu cores and Ram size in our current system will it help us ?

Currently we are using 8 core CPU and 16 GB of RAM.

What’s the load of you system at all? Do you see any bottle necks on the underlaying system?

I guess just add more ressources has it’s limit so far, but I am not expert for the nagios core.

Is it all in LAN environment or do you have some WAN hosts? Do you have a 60s check period or higher value?

CPU core utilization

Tasks: 367 total, 34 running, 332 sleeping, 0 stopped, 1 zombie
Cpu0 : 75.4%us, 24.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 76.9%us, 22.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu2 : 81.4%us, 18.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu3 : 70.5%us, 29.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu4 : 80.3%us, 18.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st
Cpu5 : 80.7%us, 18.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu6 : 81.1%us, 18.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu7 : 70.3%us, 29.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 16423292k total, 13932816k used, 2490476k free, 397540k buffers
Swap: 8380412k total, 289192k used, 8091220k free, 7909680k cached

and load:

07:00:24 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
07:10:19 66 580 56.89 56.64 56.51
07:20:18 64 569 57.41 57.89 57.13
07:30:20 55 540 58.83 58.04 57.53
07:40:20 74 570 56.25 55.69 56.38
07:50:22 85 640 60.13 57.72 56.40
08:00:17 71 591 63.78 64.98 60.56
08:10:15 56 530 40.56 57.35 60.66
08:20:22 56 593 47.85 38.83 49.22
08:30:25 82 664 62.68 56.29 52.29
08:40:25 72 636 59.58 58.37 55.51
08:50:18 10 511 57.66 51.77 53.59
09:00:26 55 621 61.09 46.97 49.16
09:10:20 71 625 41.94 48.54 50.50
09:20:20 56 589 58.40 56.67 53.61
09:30:20 85 645 56.55 58.11 56.16
09:40:19 54 577 52.42 58.57 58.05
09:50:26 88 624 66.21 56.57 56.49
10:00:21 8 493 67.10 67.03 61.89
10:11:17 87 652 78.86 72.03 67.07
10:20:27 66 624 65.20 66.93 67.40
10:30:21 50 579 46.65 51.11 59.19
10:40:21 67 619 54.29 54.45 57.55
10:50:22 68 596 65.54 60.81 59.29
11:00:35 68 610 57.60 58.81 58.78
11:10:18 10 490 51.68 55.01 57.11
11:20:19 54 525 58.61 57.03 56.99
11:30:20 55 560 52.74 55.62 56.50
11:40:21 85 642 50.91 49.88 52.53
11:50:16 52 623 63.01 55.61 53.29
12:00:26 78 645 61.08 56.36 55.07
12:10:20 13 482 71.09 64.10 58.94
12:20:20 42 548 67.03 69.81 64.70
12:30:07 7 495 52.48 59.10 61.53
12:40:21 63 561 65.26 59.68 58.91
12:50:42 61 592 57.18 56.28 56.79
13:00:20 71 598 56.04 57.87 57.70
13:10:22 65 621 62.27 60.79 59.37
13:20:16 14 594 53.52 59.19 59.14
13:30:24 79 654 62.27 59.05 59.27
13:40:12 37 542 62.90 61.50 60.26
13:50:17 45 610 56.96 55.61 56.89
Average: 58 590 60.73 59.71 59.56

Nice Overload :smiley:
Increase your machine to a minimum of 32 Cores.
Today i had a machine with CRE and 700 Hosts, nice mix of agent and SNMP.
16 Cores from a old Intel Server CPU are needed.

The CPU type is also important to determine the real need. But with your load average between 55 and 65, it means you have 8 times more processes waiting than you have CPU cores available.

Your load average should not be higher than your CPU count.

1 Like

And with that load you would definitely benefit from the CMC from the Enterprise Edition.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.