Summing the omd cores, I'm at around 50 service checks/sec and 4 host checks/sec.
I have hundreds of CPU and nearly 1TB of RAM to spare so "adding" resources is not an issue per se
If I'd see constant high cpu usage it'd be more easy to reply to the question "do I need more resources" but the simple fact is that:
a) only a few TYPES of checks, which are all PASSIVE, show up as stale
b) checks are not spread among the interval: they almost all happen in a 90sec timeframe and 3 full minutes are then left doing nothing (with the 5 minute configured interval)
c) can't understand the figures about RAM, but if numbers with other installations *do* match, I know that to use check_mk I need 6 GB of RAM each 12k services
Simone Bizzotto
···
-----Original Message-----
From: checkmk-en [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Christopher Cox
Sent: giovedì 21 giugno 2018 18:35
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] check_mk performance
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Ours is more out of band (which is recommended for monitoring)
We're running the older 1.2.8p6, but will be testing 1.4 soon (in an inband VM but just for QA and Dev).
You've got a ton of services for one Check_MK. I can say on 1.2.8p6 and remember, we have a dedicated physical HW with 2 x X5650's and 32GB memory on a 10Gbit network. We have about 4,800 services and that's pushing scale. With that said, people tell me that 1.4 scales much better (not sure if will scale to the levels you're talking about though, and of course it all depends on the "plugins" and such that you have... some services are "slow" and "heavy" to monitor).
So.. can't say for sure about 1.4 (we took down our initial test for various reasons). Our 1.4 when it comes back up with be done as a VM because it will be just for QA and Dev. IMHO, you may need more than one Check_MK if you have a lot to monitor.
Oh, with the exceptions of a few services where we know they are heavy, our normal checks happen at the default 1 minute intervals for most all of those 4000+ services across 127 monitored servers (again, IMHO, we're about the limit of what I like to see on a single Check_MK server).
Also remember that individual client load can also affect Check_MK passes.
(so we get occasional "stale" passes for some services, you just don't want to see them all the time)
Your CPU loads are a LOT higher than ours, but we have a much bigger machine.
Is your Check_MK monitored (are you monitoring yourself)?
Look at your Checks per second (host checks/service checks)
Ours does about 80 service checks per second (average over 1 week) which puts us at the 1 minute interval.
If you have 12250 checks... of course you also lowered your interval to
5 minutes, which is pretty huge. So by my math, you had better be doing at least 40 checks per second even to maintain that 5 minute interval.
On 06/20/2018 07:58 AM, Simone Bizzotto wrote:
Hello all,
we’re currently evaluating Check_MK but we’re in a bit of a pickle.
We installed the CRE edition (started with 1.4.0p26, upgraded recently
to 1.4.0p31) and observing a steady number of “stale” checks.We have 90% of the hosts which are Windows and checked via Check_MK
agent, stale service checks are always related to:* Disk IO Summary
* Processor Queue
* Web ServiceFrom what I understand, those are passive checks so if they’re
“Nagios-related”, I’d get staleness also on the Check_MK service
itself… Check_MK service is instead always “not-stale”, and those
(again, AFAIK) dependant passive checks are instead getting stale.Aside from anyone explaining why this does happen, what are the
expected figures on a “fresh” install ?These are the details, ask away if you need more:
* OS is Debian 8 on a VM, 4 CPU, 4GB RAM
* Load is 6.40, 3.87, 4.28
* CPU usage is spiky, Check_MK interval has ben relaxed to 5 minutes
* RAM shows WARN - RAM used: 1.86 GB of 3.86 GB, Swap used: 0.00 B of
1.86 GB, Total virtual memory used: 1.86 GB of 5.72 GB (32.5%),
Committed: 6.97 GB (121.8% of RAM + Swap, warn/crit at 100.0%/150.0%)
* Hosts are 464
* Services are 12250
* Speed-O-Meter continues to flip from 50 to 150-180-200%We created two additional monitoring sites (same host) and tried to
move some hosts to them (hoping a more linear resource usage) but that
didn’t help either.Summing up, I have Check_MK:
* with a low CPU load generally, with spikes that seems to point out
to a “not fair” scheduling algorithm
* it’s starting to have too much stale controls
* it’s using loads of committed RAM and I really don’t know whyNot having a reference (or, at least, a modern one based on CRE and
not CEE or anything else) I’m not able to tell if this is all
perfectly normal or notCan you guys help ?
Thanks
Simone Bizzotto
Il Gruppo ABB in Italia ha adottato il Modello Organizzativo ai sensi
del D.Lgs. 231/2001, in forza del quale l'assunzione di obbligazioni
da parte di societa' ABB avviene a firma congiunta di due procuratori
muniti di idonei poteri, con la sola eccezione delle persone che
rivestono la carica di Amministratore Delegato o di Direttore
Generale.Le informazioni contenute in questo messaggio di posta
elettronica sono riservate e confidenziali e ne e' vietata la
diffusione in qualunque modo eseguita.Qualora Lei non fosse la persona
cui il presente messaggio e' destinato, e' invitata a non diffonderlo,
e ad eliminarlo, dandone gentilmente comunicazione al mittente.
ABB Group in Italy adopts a Compliance Programme under the Italian Law
(D.Lgs.231/2001). According to this ABB Compliance Programme, any
commitment of ABB Italian Companies is taken by the double signature
of ABB Representatives granted by a proper Power of Attorney with the
only exception of Managing Director or General Manager.The information
included in this e-mail and any attachments are confidential and may
also be privileged. If you are not the correct recipient, you are
kindly requested to notify the sender immediately, to cancel it and
not to disclose the contents to any other person._______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
Manage your subscription or unsubscribe
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
Manage your subscription or unsubscribe
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
Il Gruppo ABB in Italia ha adottato il Modello Organizzativo ai sensi del D.Lgs. 231/2001, in forza del quale l'assunzione di obbligazioni da parte di societa' ABB avviene a firma congiunta di due procuratori muniti di idonei poteri, con la sola eccezione delle persone che rivestono la carica di Amministratore Delegato o di Direttore Generale.Le informazioni contenute in questo messaggio di posta elettronica sono riservate e confidenziali e ne e' vietata la diffusione in qualunque modo eseguita.Qualora Lei non fosse la persona cui il presente messaggio e' destinato, e' invitata a non diffonderlo, e ad eliminarlo, dandone gentilmente comunicazione al mittente.
ABB Group in Italy adopts a Compliance Programme under the Italian Law (D.Lgs.231/2001). According to this ABB Compliance Programme, any commitment of ABB Italian Companies is taken by the double signature of ABB Representatives granted by a proper Power of Attorney with the only exception of Managing Director or General Manager.The information included in this e-mail and any attachments are confidential and may also be privileged. If you are not the correct recipient, you are kindly requested to notify the sender immediately, to cancel it and not to disclose the contents to any other person.