[Check_mk (english)] R: Re: Troubleshooting Performance Issues - CMK Jump in CPU Usage

I’m also interested in this thread but while I have been able to locate the option to disable hardware/software inventory, (WATO, Host & Service Parameters, Hardware/Software-Inventory, Do hardware/software Inventory), I couldn’t understand how to just change frequency. Could you please tell me exactly where to find the rule?

···

On 1/27/2017 10:45 AM, Mathieu Levi
wrote:

    There was no timeframe specified in the rule.  As a

result it was probably triggering it to run on every host check
interval by default.

Matt

          On Fri, Jan 27, 2017 at 10:32 AM,

Jam Mulch wrote:

              Who often was the

hw/sw inventory being run? I have a rule to set mine

              to only run once a day. I also created a host tag so I

can limit which hosts

              get inventoried as well. I don't remember what the

default is, but the rule

              enabling inventory recommends creating a rule to limit

it to somewhere

              between 2hrs and 24hrs



                    On

1/27/2017 10:16 AM, Mathieu Levi wrote:

                      Update:  the CPU spike issue has

been identified. It turns out, disabling this
rule:

                          "Do

hardware/software Inventory "

                          Reduced the CPU

usage of the Check_MK server by 50%,
literally. It was at a 70% sustained
rate, down to a 20%. The funny thing is,
we had some of the dependent plugins for
that rule only installed on a couple of
monitored hosts experimentally, so it must
be some sort of a bug (again, CMK 1.2.8p9)
server-side.

                          We're waiting to

see if having a lower CPU
utilization/contention fixes the other
issue with random missing data in some of
the graphs.

                          Matt
                            On Wed, Jan 18,

2017 at 4:32 PM, Mathieu Levi wrote:

                                -- The CPU usage is

coming from Check_MK, absolutely
(the top processes right now just
happen to be
– It’s CRE, installed via
omd (off of Matias’ main
download page)

                                    -- load average around 30 (on

a 12-core / hyperthreaded box).
It’s relatively high.

                                    -- tmpfs is there, but it's

only using about 122M right now.

                                            On

Wed, Jan 18, 2017 at
4:26 PM, Jam Mulch
wrote:

                                                Have you tried

rebooting the
server?

                                                Is the cpu usage

coming from cmk?
(omd stop, then see
if the cpu util
drops to near 0
using top)…I’d use
top to locate the
processes using the
most
cpu,memory,etc…

                                                CRE, Enterprise,

OMD, or stand-alone
check_mk?

                                                If you lost your

tmpfs storage, it
slows the heck out
of cmk. BTW, you are
not even

                                                near 100% cpu util,

so something else
must be the real
problem. (What’s the

                                                load avg, %wa,

etc…it could be
blocked on io…?)

                                                # df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda2 29G 15G 13G 55% /

tmpfs 3.9G 0 3.9G 0% /dev/shm

tmpfs 3.9G 404K 3.9G 1% /opt/omd/sites/site1/tmp

tmpfs 3.9G 280K 3.9G 1% /opt/omd/sites/site2/tmp

tmpfs 3.9G 240K 3.9G 1% /opt/omd/sites/site3/tmp

                                                      On

01/18/2017
03:20 PM,
Mathieu Levi
wrote:

Hello,

                                                      Does

anyone have
insight into
the types of
rules or
configuration
items that
cause a
massive,
permanent CPU
spike in the
running of
Check_MK
1.2.8pX ?

                                                      My

situation is
that I have a
monitoring
server that
has 500 or so
hosts
monitored,
about 20,000
services.
Sometime
around mid
December, the
average CPU
utilization of
the monitoring
server jumped
from 20% to
over 70%,
sustained, and
stayed that
way forever
(attaching
screenshot).
I’m trying to
figure that
out as the
jump was
massive and
immediate.

                                                      Here's

what did and
did not
change:

                                                      ***  There

are no new
hosts added to
be monitored
since before
the spike!
Maybe a few
custom (local)
checks.

                                                      ***  The

hardware of
the box is
still the same
12-core box as
before the
spike

                                                      ***  No other

non-Check_MK
processes or
omd sites are
running on the
server, before
or now

                                                      ***  I've

tried
restarting the
entire
monitoring
server (via
Matias’ “omd”
installation,
btw) numerous
times

                                                      I

tried looking
at the audit
log and all I
saw of
interest was
one item on
that day,
which changed
“Maximum
number of
checks attempt
for service”
rule to 16 for
one service.
I removed that
rule with no
real effect on
load.

                                                      I

also looked at
the top
running
processes,
it’s mostly
short-lived
python
processes
doing a wide
variety of
things.

                                                      One

of the reasons
I care about
this issue is
that there are
starting to be
time gaps in
the rrd graphs
for services,
and I’m
suspecting
high load on
the monitoring
server could
be to blame.

                                                      My

theory is that
either

                                                      a)

I’m hitting a
bug in this
version of
Check_MK

                                                      b)

I have a rule
that’s causing
headaches in
the
processing, or
an errant
option
someplacein
the monitoring
config

                                                      c)

the RRD db’s
have gotten
large
(although I
doubt that’s
the cause)

                                                      d)

the Python
version being
used with my
installation,
2.6.6 has some
issues for
some reason
only being
seen now.

                                                      Are

there any devs
viewing this
that might
suggest how to
troubleshoot?
Is there a
mode or a log
that can clue
me in here?

Thanks!

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
[http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en](http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en)
                                  python

/omd/sites/prod/share/check_mk /modules/check_mk.py
–defaults
/omd/sites/prod/etc/check_mk/d efaults
–inv-fail-status=1 --hw-changes=0
–sw-changes=0 --cache
–inventory-as-check

WATO - Configuration → Host & Service Parameters →
Monitoring Configuration → Service Checks → Normal check
interval for service checks

Create a rule in folder Main directory with Service name is `**        Check_MK

HW/SW Inventory

 ** `

tree_90.pngNormal check interval for service checks

days hours mins secs

`**

    Conditions:
 ** `

Explicit hosts…

Services…

···

On 1/27/2017 11:59 AM, wrote:

mlist@libero.it

    I'm also interested in this thread but while I have been able

to locate the option to disable hardware/software inventory,
(WATO, Host & Service Parameters,
Hardware/Software-Inventory, Do hardware/software Inventory), I
couldn’t understand how to just change frequency. Could you
please tell me exactly where to find the rule?

----Messaggio originale----

    Da: "Jam Mulch" Data: 27/01/2017 17.36

A: “Mathieu Levi”
Cc: Ogg: Re: [Check_mk (english)] Troubleshooting Performance Issues

mlevi+cmk@collective.com
checkmk-en@lists.mathias-kettner.de

      On 1/27/2017 10:45 AM, Mathieu Levi

wrote:

        There was no timeframe specified in the rule. 

As a result it was probably triggering it to run on every
host check interval by default.

Matt

              On Fri, Jan 27, 2017 at 10:32

AM, Jam Mulch wrote:

                  Who often was

the hw/sw inventory being run? I have a rule to
set mine

                  to only run once a day. I also created a host tag

so I can limit which hosts

                  get inventoried as well. I don't remember what the

default is, but the rule

                  enabling inventory recommends creating a rule to

limit it to somewhere

                  between 2hrs and 24hrs



                        On

1/27/2017 10:16 AM, Mathieu Levi wrote:

                          Update:  the CPU spike issue

has been identified. It turns out,
disabling this rule:

                              "Do

hardware/software Inventory "

                              Reduced the CPU

usage of the Check_MK server by 50%,
literally. It was at a 70% sustained
rate, down to a 20%. The funny thing
is, we had some of the dependent
plugins for that rule only installed
on a couple of monitored hosts
experimentally, so it must be some
sort of a bug (again, CMK 1.2.8p9)
server-side.

                              We're waiting

to see if having a lower CPU
utilization/contention fixes the other
issue with random missing data in some
of the graphs.

                              Matt
                                On Wed, Jan

18, 2017 at 4:32 PM, Mathieu
Levi wrote:

                                    -- The CPU usage is

coming from Check_MK, absolutely
(the top processes right now
just happen to be
– It’s CRE, installed
via omd (off of Matias’ main
download page)

                                        -- load average around 30

(on a 12-core /
hyperthreaded box). It’s
relatively high.

                                        -- tmpfs is there, but

it’s only using about 122M
right now.

                                                On

Wed, Jan 18, 2017 at
4:26 PM, Jam Mulch
wrote:

                                                    Have you tried

rebooting the
server?

                                                    Is the cpu usage

coming from cmk?
(omd stop, then
see if the cpu
util drops to
near 0 using
top)…I’d use
top to locate
the processes
using the most
cpu,memory,etc…

                                                    CRE, Enterprise,

OMD, or
stand-alone
check_mk?

                                                    If you lost your

tmpfs storage,
it slows the
heck out of cmk.
BTW, you are not
even

                                                    near 100% cpu

util, so
something else
must be the real
problem. (What’s
the

                                                    load avg, %wa,

etc…it could
be blocked on
io…?)

                                                    # df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda2 29G 15G 13G 55% /

tmpfs 3.9G 0 3.9G 0% /dev/shm

tmpfs 3.9G 404K 3.9G 1% /opt/omd/sites/site1/tmp

tmpfs 3.9G 280K 3.9G 1% /opt/omd/sites/site2/tmp

tmpfs 3.9G 240K 3.9G 1% /opt/omd/sites/site3/tmp

                                                      On

01/18/2017
03:20 PM,
Mathieu Levi
wrote:

Hello,

                                                      Does

anyone have
insight into
the types of
rules or
configuration
items that
cause a
massive,
permanent CPU
spike in the
running of
Check_MK
1.2.8pX ?

                                                      My

situation is
that I have a
monitoring
server that
has 500 or so
hosts
monitored,
about 20,000
services.
Sometime
around mid
December, the
average CPU
utilization of
the monitoring
server jumped
from 20% to
over 70%,
sustained, and
stayed that
way forever
(attaching
screenshot).
I’m trying to
figure that
out as the
jump was
massive and
immediate.

                                                      Here's

what did and
did not
change:

                                                      ***  There

are no new
hosts added to
be monitored
since before
the spike!
Maybe a few
custom (local)
checks.

                                                      ***  The

hardware of
the box is
still the same
12-core box as
before the
spike

                                                      ***  No other

non-Check_MK
processes or
omd sites are
running on the
server, before
or now

                                                      ***  I've

tried
restarting the
entire
monitoring
server (via
Matias’ “omd”
installation,
btw) numerous
times

                                                      I

tried looking
at the audit
log and all I
saw of
interest was
one item on
that day,
which changed
“Maximum
number of
checks attempt
for service”
rule to 16 for
one service.
I removed that
rule with no
real effect on
load.

                                                      I

also looked at
the top
running
processes,
it’s mostly
short-lived
python
processes
doing a wide
variety of
things.

                                                      One

of the reasons
I care about
this issue is
that there are
starting to be
time gaps in
the rrd graphs
for services,
and I’m
suspecting
high load on
the monitoring
server could
be to blame.

                                                      My

theory is that
either

                                                      a)

I’m hitting a
bug in this
version of
Check_MK

                                                      b)

I have a rule
that’s causing
headaches in
the
processing, or
an errant
option
someplacein
the monitoring
config

                                                      c)

the RRD db’s
have gotten
large
(although I
doubt that’s
the cause)

                                                      d)

the Python
version being
used with my
installation,
2.6.6 has some
issues for
some reason
only being
seen now.

                                                      Are

there any devs
viewing this
that might
suggest how to
troubleshoot?
Is there a
mode or a log
that can clue
me in here?

Thanks!

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
[http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en](http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en)
                                      python

/omd/sites/prod/share/check_mk /modules/check_mk.py
–defaults
/omd/sites/prod/etc/check_mk/d efaults
–inv-fail-status=1
–hw-changes=0 --sw-changes=0
–cache --inventory-as-check