I’m also interested in this thread but while I have been able to locate the option to disable hardware/software inventory, (WATO, Host & Service Parameters, Hardware/Software-Inventory, Do hardware/software Inventory), I couldn’t understand how to just change frequency. Could you please tell me exactly where to find the rule?
···
On 1/27/2017 10:45 AM, Mathieu Levi
wrote:
There was no timeframe specified in the rule. As a
result it was probably triggering it to run on every host check
interval by default.
Matt
On Fri, Jan 27, 2017 at 10:32 AM,
Jam Mulch wrote:
Who often was the
hw/sw inventory being run? I have a rule to set mine
to only run once a day. I also created a host tag so I
can limit which hosts
get inventoried as well. I don't remember what the
default is, but the rule
enabling inventory recommends creating a rule to limit
it to somewhere
between 2hrs and 24hrs
On
1/27/2017 10:16 AM, Mathieu Levi wrote:
Update: the CPU spike issue has
been identified. It turns out, disabling this
rule:
"Do
hardware/software Inventory "
Reduced the CPU
usage of the Check_MK server by 50%,
literally. It was at a 70% sustained
rate, down to a 20%. The funny thing is,
we had some of the dependent plugins for
that rule only installed on a couple of
monitored hosts experimentally, so it must
be some sort of a bug (again, CMK 1.2.8p9)
server-side.
We're waiting to
see if having a lower CPU
utilization/contention fixes the other
issue with random missing data in some of
the graphs.
Matt
On Wed, Jan 18,
2017 at 4:32 PM, Mathieu Levi wrote:
-- The CPU usage is
coming from Check_MK, absolutely
(the top processes right now just
happen to be
– It’s CRE, installed via
omd (off of Matias’ main
download page)
-- load average around 30 (on
a 12-core / hyperthreaded box).
It’s relatively high.
-- tmpfs is there, but it's
only using about 122M right now.
On
Wed, Jan 18, 2017 at
4:26 PM, Jam Mulch
wrote:
Have you tried
rebooting the
server?
Is the cpu usage
coming from cmk?
(omd stop, then see
if the cpu util
drops to near 0
using top)…I’d use
top to locate the
processes using the
most
cpu,memory,etc…
CRE, Enterprise,
OMD, or stand-alone
check_mk?
If you lost your
tmpfs storage, it
slows the heck out
of cmk. BTW, you are
not even
near 100% cpu util,
so something else
must be the real
problem. (What’s the
load avg, %wa,
etc…it could be
blocked on io…?)
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 29G 15G 13G 55% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 404K 3.9G 1% /opt/omd/sites/site1/tmp
tmpfs 3.9G 280K 3.9G 1% /opt/omd/sites/site2/tmp
tmpfs 3.9G 240K 3.9G 1% /opt/omd/sites/site3/tmp
On
01/18/2017
03:20 PM,
Mathieu Levi
wrote:
Hello,
Does
anyone have
insight into
the types of
rules or
configuration
items that
cause a
massive,
permanent CPU
spike in the
running of
Check_MK
1.2.8pX ?
My
situation is
that I have a
monitoring
server that
has 500 or so
hosts
monitored,
about 20,000
services.
Sometime
around mid
December, the
average CPU
utilization of
the monitoring
server jumped
from 20% to
over 70%,
sustained, and
stayed that
way forever
(attaching
screenshot).
I’m trying to
figure that
out as the
jump was
massive and
immediate.
Here's
what did and
did not
change:
*** There
are no new
hosts added to
be monitored
since before
the spike!
Maybe a few
custom (local)
checks.
*** The
hardware of
the box is
still the same
12-core box as
before the
spike
*** No other
non-Check_MK
processes or
omd sites are
running on the
server, before
or now
*** I've
tried
restarting the
entire
monitoring
server (via
Matias’ “omd”
installation,
btw) numerous
times
I
tried looking
at the audit
log and all I
saw of
interest was
one item on
that day,
which changed
“Maximum
number of
checks attempt
for service”
rule to 16 for
one service.
I removed that
rule with no
real effect on
load.
I
also looked at
the top
running
processes,
it’s mostly
short-lived
python
processes
doing a wide
variety of
things.
One
of the reasons
I care about
this issue is
that there are
starting to be
time gaps in
the rrd graphs
for services,
and I’m
suspecting
high load on
the monitoring
server could
be to blame.
My
theory is that
either
a)
I’m hitting a
bug in this
version of
Check_MK
b)
I have a rule
that’s causing
headaches in
the
processing, or
an errant
option
someplacein
the monitoring
config
c)
the RRD db’s
have gotten
large
(although I
doubt that’s
the cause)
d)
the Python
version being
used with my
installation,
2.6.6 has some
issues for
some reason
only being
seen now.
Are
there any devs
viewing this
that might
suggest how to
troubleshoot?
Is there a
mode or a log
that can clue
me in here?
Thanks!
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
[http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en](http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en)
WATO - Configuration → Host & Service Parameters →
Monitoring Configuration → Service Checks → Normal check
interval for service checks
Create a rule in folder Main directory with Service name is `** Check_MK
HW/SW Inventory
** `
Normal check interval for service checks
days hours mins secs
`**
Conditions:
** `
Explicit hosts…
Services…
···
On 1/27/2017 11:59 AM, wrote:
mlist@libero.it
I'm also interested in this thread but while I have been able
to locate the option to disable hardware/software inventory,
(WATO, Host & Service Parameters,
Hardware/Software-Inventory, Do hardware/software Inventory), I
couldn’t understand how to just change frequency. Could you
please tell me exactly where to find the rule?
As a result it was probably triggering it to run on every
host check interval by default.
Matt
On Fri, Jan 27, 2017 at 10:32
AM, Jam Mulch wrote:
Who often was
the hw/sw inventory being run? I have a rule to
set mine
to only run once a day. I also created a host tag
so I can limit which hosts
get inventoried as well. I don't remember what the
default is, but the rule
enabling inventory recommends creating a rule to
limit it to somewhere
between 2hrs and 24hrs
On
1/27/2017 10:16 AM, Mathieu Levi wrote:
Update: the CPU spike issue
has been identified. It turns out,
disabling this rule:
"Do
hardware/software Inventory "
Reduced the CPU
usage of the Check_MK server by 50%,
literally. It was at a 70% sustained
rate, down to a 20%. The funny thing
is, we had some of the dependent
plugins for that rule only installed
on a couple of monitored hosts
experimentally, so it must be some
sort of a bug (again, CMK 1.2.8p9)
server-side.
We're waiting
to see if having a lower CPU
utilization/contention fixes the other
issue with random missing data in some
of the graphs.
Matt
On Wed, Jan
18, 2017 at 4:32 PM, Mathieu
Levi wrote:
-- The CPU usage is
coming from Check_MK, absolutely
(the top processes right now
just happen to be
– It’s CRE, installed
via omd (off of Matias’ main
download page)
-- load average around 30
(on a 12-core /
hyperthreaded box). It’s
relatively high.
-- tmpfs is there, but
it’s only using about 122M
right now.
On
Wed, Jan 18, 2017 at
4:26 PM, Jam Mulch
wrote:
Have you tried
rebooting the
server?
Is the cpu usage
coming from cmk?
(omd stop, then
see if the cpu
util drops to
near 0 using
top)…I’d use
top to locate
the processes
using the most
cpu,memory,etc…
CRE, Enterprise,
OMD, or
stand-alone
check_mk?
If you lost your
tmpfs storage,
it slows the
heck out of cmk.
BTW, you are not
even
near 100% cpu
util, so
something else
must be the real
problem. (What’s
the
load avg, %wa,
etc…it could
be blocked on
io…?)
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 29G 15G 13G 55% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 404K 3.9G 1% /opt/omd/sites/site1/tmp
tmpfs 3.9G 280K 3.9G 1% /opt/omd/sites/site2/tmp
tmpfs 3.9G 240K 3.9G 1% /opt/omd/sites/site3/tmp
On
01/18/2017
03:20 PM,
Mathieu Levi
wrote:
Hello,
Does
anyone have
insight into
the types of
rules or
configuration
items that
cause a
massive,
permanent CPU
spike in the
running of
Check_MK
1.2.8pX ?
My
situation is
that I have a
monitoring
server that
has 500 or so
hosts
monitored,
about 20,000
services.
Sometime
around mid
December, the
average CPU
utilization of
the monitoring
server jumped
from 20% to
over 70%,
sustained, and
stayed that
way forever
(attaching
screenshot).
I’m trying to
figure that
out as the
jump was
massive and
immediate.
Here's
what did and
did not
change:
*** There
are no new
hosts added to
be monitored
since before
the spike!
Maybe a few
custom (local)
checks.
*** The
hardware of
the box is
still the same
12-core box as
before the
spike
*** No other
non-Check_MK
processes or
omd sites are
running on the
server, before
or now
*** I've
tried
restarting the
entire
monitoring
server (via
Matias’ “omd”
installation,
btw) numerous
times
I
tried looking
at the audit
log and all I
saw of
interest was
one item on
that day,
which changed
“Maximum
number of
checks attempt
for service”
rule to 16 for
one service.
I removed that
rule with no
real effect on
load.
I
also looked at
the top
running
processes,
it’s mostly
short-lived
python
processes
doing a wide
variety of
things.
One
of the reasons
I care about
this issue is
that there are
starting to be
time gaps in
the rrd graphs
for services,
and I’m
suspecting
high load on
the monitoring
server could
be to blame.
My
theory is that
either
a)
I’m hitting a
bug in this
version of
Check_MK
b)
I have a rule
that’s causing
headaches in
the
processing, or
an errant
option
someplacein
the monitoring
config
c)
the RRD db’s
have gotten
large
(although I
doubt that’s
the cause)
d)
the Python
version being
used with my
installation,
2.6.6 has some
issues for
some reason
only being
seen now.
Are
there any devs
viewing this
that might
suggest how to
troubleshoot?
Is there a
mode or a log
that can clue
me in here?
Thanks!
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
[http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en](http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en)