[Check_mk (english)] Memory management and overcommit for check_mk

Oliver_O_Boyle · June 27, 2016, 6:04pm

Hi,

I’ve noticed that my cmk vms overcommit a huge amount of memory but appears to operate just fine with very little memory. For example, on a system with 70 hosts and 1500 services that’s also got Event Console enable and is receiving snmp traps from about 55-60 devices, what is the recommended amount of memory?

The 70 host system runs fine with 1GB RAM but it consistently overcommits to between 5-6GB RAM even though it never appears to use more than 1GB of installed memory regardless of how much installed memory I give it. 4GB of installed memory didn’t change anything other than allowing for some more caching to take place with more Active memory showing than before. But there were no human noticeable advantages to adding this extra memory.

If I tweak vm.overcommit_memory and vm.overcommit_ratio I can confirm that cmk breaks. When it can’t overcommit any more, either it stops serving web pages (proxy errors) or hosts start to grey out and appear disconnected.

At the moment, I need to change the thresholds for this system or disable notifications because the Memory service keeps flapping between OK/WARN/CRIT as a result of the overcommits. I’d prefer not to disable, so at a minimum, I need to know what thresholds to consider normal.

Any insight into best practices would be appreciated.

Oliver

···

_________________________________

Oliver O’Boyle

Director, IT • Atlific Hotels

250 Saint-Antoine W., Suite 400 Montreal, Quebec H2Y 0A3

T: 514.509.5545 C: 514.608.8533 F: 514.509.5498

ooboyle@atlific.com www.atlific.com

_________________________________

andreas-doehler · June 27, 2016, 6:21pm

Hi Oliver,

if you look at the running processes i think the most memory is consumed/committed from Apache processes.

The monitoring core don’t need to much memory and EventConsole memory consumption depends on the amount of open events.

On my machines 4GB is a minimum if users working with the web frontend also if the machine is integrated in a distributed setup apache will be the

process with the most memory usage.

Best regards.

Andreas

···

Oliver O’Boyle ooboyle@atlific.com schrieb am Mo., 27. Juni 2016 um 20:06 Uhr:

Hi,

I’ve noticed that my cmk vms overcommit a huge amount of memory but appears to operate just fine with very little memory. For example, on a system with 70 hosts and 1500 services that’s also got Event Console enable and is receiving snmp traps from about 55-60 devices, what is the recommended amount of memory?

The 70 host system runs fine with 1GB RAM but it consistently overcommits to between 5-6GB RAM even though it never appears to use more than 1GB of installed memory regardless of how much installed memory I give it. 4GB of installed memory didn’t change anything other than allowing for some more caching to take place with more Active memory showing than before. But there were no human noticeable advantages to adding this extra memory.

If I tweak vm.overcommit_memory and vm.overcommit_ratio I can confirm that cmk breaks. When it can’t overcommit any more, either it stops serving web pages (proxy errors) or hosts start to grey out and appear disconnected.

At the moment, I need to change the thresholds for this system or disable notifications because the Memory service keeps flapping between OK/WARN/CRIT as a result of the overcommits. I’d prefer not to disable, so at a minimum, I need to know what thresholds to consider normal.

Any insight into best practices would be appreciated.

Oliver

_________________________________

Oliver O’Boyle

Director, IT • Atlific Hotels

250 Saint-Antoine W., Suite 400 Montreal, Quebec H2Y 0A3

T: 514.509.5545 C: 514.608.8533 F: 514.509.5498

ooboyle@atlific.com www.atlific.com

_________________________________

checkmk-en mailing list

checkmk-en@lists.mathias-kettner.de

http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Oliver_O_Boyle · June 27, 2016, 6:57pm

Thanks, Andreas.

For sure Apache is a major memory user here. And I am running in a distributed environment as well. We’re using the web front end but minimally (usually only 1 or 2 sessions concurrently).

Out of curiosity, how many hosts/services are you monitoring and how many distributed sites? How committed memory do you see?

As I mentioned, doubling and even tripling the installed RAM didn’t make much of a difference anywhere that I could see (gui response times are similar, no apparent errors etc…).

Oliver

···

From: Andreas Döhler [mailto:andreas.doehler@gmail.com]
Sent: June-27-16 2:21 PM
To: Oliver O’Boyle ooboyle@atlific.com; checkmk-en checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

Hi Oliver,

if you look at the running processes i think the most memory is consumed/committed from Apache processes.

The monitoring core don’t need to much memory and EventConsole memory consumption depends on the amount of open events.

On my machines 4GB is a minimum if users working with the web frontend also if the machine is integrated in a distributed setup apache will be the

process with the most memory usage.

Best regards.

Andreas

Oliver O’Boyle ooboyle@atlific.com schrieb am Mo., 27. Juni 2016 um 20:06 Uhr:

Hi,

I’ve noticed that my cmk vms overcommit a huge amount of memory but appears to operate just fine with very little memory. For example, on a system with 70 hosts and 1500 services that’s also got Event Console enable and is receiving snmp traps from about 55-60 devices, what is the recommended amount of memory?

The 70 host system runs fine with 1GB RAM but it consistently overcommits to between 5-6GB RAM even though it never appears to use more than 1GB of installed memory regardless of how much installed memory I give it. 4GB of installed memory didn’t change anything other than allowing for some more caching to take place with more Active memory showing than before. But there were no human noticeable advantages to adding this extra memory.

If I tweak vm.overcommit_memory and vm.overcommit_ratio I can confirm that cmk breaks. When it can’t overcommit any more, either it stops serving web pages (proxy errors) or hosts start to grey out and appear disconnected.

At the moment, I need to change the thresholds for this system or disable notifications because the Memory service keeps flapping between OK/WARN/CRIT as a result of the overcommits. I’d prefer not to disable, so at a minimum, I need to know what thresholds to consider normal.

Any insight into best practices would be appreciated.

Oliver

_________________________________

Oliver O’Boyle

Director, IT • Atlific Hotels

250 Saint-Antoine W., Suite 400 Montreal, Quebec H2Y 0A3

T: 514.509.5545 C: 514.608.8533 F: 514.509.5498

ooboyle@atlific.com www.atlific.com

_________________________________

checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Oliver_O_Boyle · June 27, 2016, 9:19pm

I’ve tweaked Apache and there’s no difference.

ps –axl (RSS column) and atop (mem VSIZE) clearly show apache2 and Nagios as the main culprits. I think the biggest culprit in Nagios which seems to spawn several 1G+ processes on a regular basis. I suspect it’s this regular stream of 1G+ processes that’s making the cmk graphs show a solid 3-6G of committed memory.

Is 1G+ normal for a Nagios thread?

···

From: Andreas Döhler [mailto:andreas.doehler@gmail.com]
Sent: June-27-16 2:21 PM
To: Oliver O’Boyle ooboyle@atlific.com; checkmk-en checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

Hi Oliver,

if you look at the running processes i think the most memory is consumed/committed from Apache processes.

The monitoring core don’t need to much memory and EventConsole memory consumption depends on the amount of open events.

On my machines 4GB is a minimum if users working with the web frontend also if the machine is integrated in a distributed setup apache will be the

process with the most memory usage.

Best regards.

Andreas

Oliver O’Boyle ooboyle@atlific.com schrieb am Mo., 27. Juni 2016 um 20:06 Uhr:

Hi,

I’ve noticed that my cmk vms overcommit a huge amount of memory but appears to operate just fine with very little memory. For example, on a system with 70 hosts and 1500 services that’s also got Event Console enable and is receiving snmp traps from about 55-60 devices, what is the recommended amount of memory?

The 70 host system runs fine with 1GB RAM but it consistently overcommits to between 5-6GB RAM even though it never appears to use more than 1GB of installed memory regardless of how much installed memory I give it. 4GB of installed memory didn’t change anything other than allowing for some more caching to take place with more Active memory showing than before. But there were no human noticeable advantages to adding this extra memory.

If I tweak vm.overcommit_memory and vm.overcommit_ratio I can confirm that cmk breaks. When it can’t overcommit any more, either it stops serving web pages (proxy errors) or hosts start to grey out and appear disconnected.

At the moment, I need to change the thresholds for this system or disable notifications because the Memory service keeps flapping between OK/WARN/CRIT as a result of the overcommits. I’d prefer not to disable, so at a minimum, I need to know what thresholds to consider normal.

Any insight into best practices would be appreciated.

Oliver

_________________________________

Oliver O’Boyle

Director, IT • Atlific Hotels

250 Saint-Antoine W., Suite 400 Montreal, Quebec H2Y 0A3

T: 514.509.5545 C: 514.608.8533 F: 514.509.5498

ooboyle@atlific.com www.atlific.com

_________________________________

checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Oliver_O_Boyle · June 28, 2016, 4:40pm

It’s definitely the continuous string of ‘nagios –ud’ commands being run.

Is it normal for so many Nagios processes to be started and terminated within seconds? Is there no better way to reduce foot print here by using a smaller number of processes, rather than a continuous string of new processes that each request 1.3GB memory when run?

Maybe I’m just being naïve J

···

From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Oliver O’Boyle
Sent: June-27-16 5:20 PM
To: Andreas Döhler andreas.doehler@gmail.com; checkmk-en checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

I’ve tweaked Apache and there’s no difference.

ps –axl (RSS column) and atop (mem VSIZE) clearly show apache2 and Nagios as the main culprits. I think the biggest culprit in Nagios which seems to spawn several 1G+ processes on a regular basis. I suspect it’s this regular stream of 1G+ processes that’s making the cmk graphs show a solid 3-6G of committed memory.

Is 1G+ normal for a Nagios thread?

From: Andreas Döhler [mailto:andreas.doehler@gmail.com]
Sent: June-27-16 2:21 PM
To: Oliver O’Boyle ooboyle@atlific.com; checkmk-en checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

Hi Oliver,

if you look at the running processes i think the most memory is consumed/committed from Apache processes.

The monitoring core don’t need to much memory and EventConsole memory consumption depends on the amount of open events.

On my machines 4GB is a minimum if users working with the web frontend also if the machine is integrated in a distributed setup apache will be the

process with the most memory usage.

Best regards.

Andreas

Oliver O’Boyle ooboyle@atlific.com schrieb am Mo., 27. Juni 2016 um 20:06 Uhr:

Hi,

I’ve noticed that my cmk vms overcommit a huge amount of memory but appears to operate just fine with very little memory. For example, on a system with 70 hosts and 1500 services that’s also got Event Console enable and is receiving snmp traps from about 55-60 devices, what is the recommended amount of memory?

The 70 host system runs fine with 1GB RAM but it consistently overcommits to between 5-6GB RAM even though it never appears to use more than 1GB of installed memory regardless of how much installed memory I give it. 4GB of installed memory didn’t change anything other than allowing for some more caching to take place with more Active memory showing than before. But there were no human noticeable advantages to adding this extra memory.

If I tweak vm.overcommit_memory and vm.overcommit_ratio I can confirm that cmk breaks. When it can’t overcommit any more, either it stops serving web pages (proxy errors) or hosts start to grey out and appear disconnected.

At the moment, I need to change the thresholds for this system or disable notifications because the Memory service keeps flapping between OK/WARN/CRIT as a result of the overcommits. I’d prefer not to disable, so at a minimum, I need to know what thresholds to consider normal.

Any insight into best practices would be appreciated.

Oliver

_________________________________

Oliver O’Boyle

Director, IT • Atlific Hotels

250 Saint-Antoine W., Suite 400 Montreal, Quebec H2Y 0A3

T: 514.509.5545 C: 514.608.8533 F: 514.509.5498

ooboyle@atlific.com www.atlific.com

_________________________________

checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

MarsellusWallace · June 28, 2016, 6:36pm

Hi Oliver,

Nagios forks for each active check. If you’re out of resources (like we were) use gearman (don’t ask me how to configure it) or switch to CEE. CEE uses workers by default and reduced needed resources by 60% here…

Regards,
Marcel

···

Oliver O’Boyle ooboyle@atlific.com schrieb am Di., 28. Juni 2016 18:41:

It’s definitely the continuous string of ‘nagios –ud’ commands being run.

Is it normal for so many Nagios processes to be started and terminated within seconds? Is there no better way to reduce foot print here by using a smaller number of processes, rather than a continuous string of new processes that each request 1.3GB memory when run?

Maybe I’m just being naïve J

From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Oliver O’Boyle
Sent: June-27-16 5:20 PM
To: Andreas Döhler andreas.doehler@gmail.com; checkmk-en checkmk-en@lists.mathias-kettner.de

Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

I’ve tweaked Apache and there’s no difference.

ps –axl (RSS column) and atop (mem VSIZE) clearly show apache2 and Nagios as the main culprits. I think the biggest culprit in Nagios which seems to spawn several 1G+ processes on a regular basis. I suspect it’s this regular stream of 1G+ processes that’s making the cmk graphs show a solid 3-6G of committed memory.

Is 1G+ normal for a Nagios thread?

From: Andreas Döhler [mailto:andreas.doehler@gmail.com]
Sent: June-27-16 2:21 PM
To: Oliver O’Boyle ooboyle@atlific.com; checkmk-en checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

Hi Oliver,

if you look at the running processes i think the most memory is consumed/committed from Apache processes.

The monitoring core don’t need to much memory and EventConsole memory consumption depends on the amount of open events.

On my machines 4GB is a minimum if users working with the web frontend also if the machine is integrated in a distributed setup apache will be the

process with the most memory usage.

Best regards.

Andreas

Oliver O’Boyle ooboyle@atlific.com schrieb am Mo., 27. Juni 2016 um 20:06 Uhr:

Hi,

I’ve noticed that my cmk vms overcommit a huge amount of memory but appears to operate just fine with very little memory. For example, on a system with 70 hosts and 1500 services that’s also got Event Console enable and is receiving snmp traps from about 55-60 devices, what is the recommended amount of memory?

The 70 host system runs fine with 1GB RAM but it consistently overcommits to between 5-6GB RAM even though it never appears to use more than 1GB of installed memory regardless of how much installed memory I give it. 4GB of installed memory didn’t change anything other than allowing for some more caching to take place with more Active memory showing than before. But there were no human noticeable advantages to adding this extra memory.

If I tweak vm.overcommit_memory and vm.overcommit_ratio I can confirm that cmk breaks. When it can’t overcommit any more, either it stops serving web pages (proxy errors) or hosts start to grey out and appear disconnected.

At the moment, I need to change the thresholds for this system or disable notifications because the Memory service keeps flapping between OK/WARN/CRIT as a result of the overcommits. I’d prefer not to disable, so at a minimum, I need to know what thresholds to consider normal.

Any insight into best practices would be appreciated.

Oliver

_________________________________

Oliver O’Boyle

Director, IT • Atlific Hotels

250 Saint-Antoine W., Suite 400 Montreal, Quebec H2Y 0A3

T: 514.509.5545 C: 514.608.8533 F: 514.509.5498

ooboyle@atlific.com www.atlific.com

_________________________________

checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

checkmk-en mailing list

checkmk-en@lists.mathias-kettner.de

http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Oliver_O_Boyle · June 28, 2016, 7:04pm

Hi Marcel,

We’re not out of resources (yet) but I’m planning to role out a small server to a large number of very small remote sites that have very low resources. At the moment, I’m trying to reduce the footprint to the smallest reasonable size without breaking functionality.

CEE is interesting and we may go that route. Gearman is also pretty interesting but probably overkill for us at this stage. That said, I have a bunch of in-house apps that may benefit significantly from it so it’s on my list of things to investigate J

I think the answer I’m looking for at this point RE this thread is: Because the Nagios processes being forked are alive for such a short time, and because they are only requesting large amounts of committed memory but are only using small amounts of total installed RAM, I can mostly ignore this phenomenon. It just means that I’ll need to change thresholds on the service checks so that memory checks don’t flap all the time.

Thanks for the input!

Oliver

···

From: Marcel Schulte [mailto:schulte.marcel@gmail.com]
Sent: June-28-16 2:37 PM
To: Oliver O’Boyle ooboyle@atlific.com; Andreas Döhler andreas.doehler@gmail.com; checkmk-en checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

Hi Oliver,

Nagios forks for each active check. If you’re out of resources (like we were) use gearman (don’t ask me how to configure it) or switch to CEE. CEE uses workers by default and reduced needed resources by 60% here…

Regards,
Marcel

Oliver O’Boyle ooboyle@atlific.com schrieb am Di., 28. Juni 2016 18:41:

It’s definitely the continuous string of ‘nagios –ud’ commands being run.

Is it normal for so many Nagios processes to be started and terminated within seconds? Is there no better way to reduce foot print here by using a smaller number of processes, rather than a continuous string of new processes that each request 1.3GB memory when run?

Maybe I’m just being naïve J

From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Oliver O’Boyle
Sent: June-27-16 5:20 PM
To: Andreas Döhler andreas.doehler@gmail.com; checkmk-en checkmk-en@lists.mathias-kettner.de

Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

I’ve tweaked Apache and there’s no difference.

ps –axl (RSS column) and atop (mem VSIZE) clearly show apache2 and Nagios as the main culprits. I think the biggest culprit in Nagios which seems to spawn several 1G+ processes on a regular basis. I suspect it’s this regular stream of 1G+ processes that’s making the cmk graphs show a solid 3-6G of committed memory.

Is 1G+ normal for a Nagios thread?

From: Andreas Döhler [mailto:andreas.doehler@gmail.com]
Sent: June-27-16 2:21 PM
To: Oliver O’Boyle ooboyle@atlific.com; checkmk-en checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Memory management and overcommit for check_mk

Hi Oliver,

if you look at the running processes i think the most memory is consumed/committed from Apache processes.

The monitoring core don’t need to much memory and EventConsole memory consumption depends on the amount of open events.

On my machines 4GB is a minimum if users working with the web frontend also if the machine is integrated in a distributed setup apache will be the

process with the most memory usage.

Best regards.

Andreas

Oliver O’Boyle ooboyle@atlific.com schrieb am Mo., 27. Juni 2016 um 20:06 Uhr:

Hi,

I’ve noticed that my cmk vms overcommit a huge amount of memory but appears to operate just fine with very little memory. For example, on a system with 70 hosts and 1500 services that’s also got Event Console enable and is receiving snmp traps from about 55-60 devices, what is the recommended amount of memory?

The 70 host system runs fine with 1GB RAM but it consistently overcommits to between 5-6GB RAM even though it never appears to use more than 1GB of installed memory regardless of how much installed memory I give it. 4GB of installed memory didn’t change anything other than allowing for some more caching to take place with more Active memory showing than before. But there were no human noticeable advantages to adding this extra memory.

If I tweak vm.overcommit_memory and vm.overcommit_ratio I can confirm that cmk breaks. When it can’t overcommit any more, either it stops serving web pages (proxy errors) or hosts start to grey out and appear disconnected.

At the moment, I need to change the thresholds for this system or disable notifications because the Memory service keeps flapping between OK/WARN/CRIT as a result of the overcommits. I’d prefer not to disable, so at a minimum, I need to know what thresholds to consider normal.

Any insight into best practices would be appreciated.

Oliver

_________________________________

Oliver O’Boyle

Director, IT • Atlific Hotels

250 Saint-Antoine W., Suite 400 Montreal, Quebec H2Y 0A3

T: 514.509.5545 C: 514.608.8533 F: 514.509.5498

ooboyle@atlific.com www.atlific.com

_________________________________

checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en