we are running distributed monitoring with check-mk-raw with ~2000 hosts
and ~60000 services spread over 8 slaves for over 10 years now. We are
planning to go enterprise.
The are two features are very important for us:
1. service dependencies discovery with root cause detection
2. predictive monitoring.
I know that predictive monitoring is only possible in CMC if the
respective checks have implement it like CPU load. But what about checks
that doesn't? Cheph for instance?
The performance improvement with CMC and other features are very nice,
but are not a reason for the migration at the moment.
Will the the root cause detection and service dependencies discovery
ever be implemented in Check_MK?
The parent child relationship or business inelegance are very tedious to
setup and contentiously changing, you can imagine when replacing some
compute nodes or introducing new tenants with new VM's, all this need to
be setup from scratch or modified.
Predictive monitoring is possible on any metric if you create a custom nagios plugin that uses rrd predictsigma and call it as an active check.
As for root cause detection… I dunno… do association rule learning on anomaly sets extracted from the data generated by the aforementioned predictsigma check?
I’d do it myself, but I’m overworked and underpaid.
we are running distributed monitoring with check-mk-raw with ~2000 hosts
and ~60000 services spread over 8 slaves for over 10 years now. We are
planning to go enterprise.
The are two features are very important for us:
service dependencies discovery with root cause detection
predictive monitoring.
I know that predictive monitoring is only possible in CMC if the
respective checks have implement it like CPU load. But what about checks
that doesn’t? Cheph for instance?
The performance improvement with CMC and other features are very nice,
but are not a reason for the migration at the moment.
Will the the root cause detection and service dependencies discovery
ever be implemented in Check_MK?
The parent child relationship or business inelegance are very tedious to
setup and contentiously changing, you can imagine when replacing some
compute nodes or introducing new tenants with new VM’s, all this need to
be setup from scratch or modified.
we are running distributed monitoring with check-mk-raw with ~2000 hosts
and ~60000 services spread over 8 slaves for over 10 years now. We are
planning to go enterprise.
The are two features are very important for us:
service dependencies discovery with root cause detection
predictive monitoring.
I know that predictive monitoring is only possible in CMC if the
respective checks have implement it like CPU load. But what about checks
that doesn’t? Cheph for instance?
The performance improvement with CMC and other features are very nice,
but are not a reason for the migration at the moment.
Will the the root cause detection and service dependencies discovery
ever be implemented in Check_MK?
The parent child relationship or business inelegance are very tedious to
setup and contentiously changing, you can imagine when replacing some
compute nodes or introducing new tenants with new VM’s, all this need to
be setup from scratch or modified.
`Sounds very interesting, even though I wanted to avoid writing all these kind of predictive monitoring scripts, but I think there is no way around it.
I am still wondering how the implementation of the service dependencies discovery and root cause detection would look like? I am not talking about a specific service, rather the whole service dependencies in our applications and infrastructure landscape.
For an example:
We had the situation that some apps go to critical, not providing data, or hangs, as for us admins, we can easily spend 10-20 minutes try to find out why, and at the end it is DB query which takes too long and blocks all other operations, so we wrote local
checks for such incidents, we have a big set of such local checks, unfortunately only the developers and the DB admins know the dependencies. Some times it just a network issue, but identifying the bottle neck can also take some time.
So we are thinking of implementing some kind of service dependencies discovery with root cause detection, I know it is not trivial, I would like to know if it is possible with check_mk and what would be the best approach?
Thanks`
···
On 12/14/19 1:47 AM, Patrick Gavin wrote:
Predictive monitoring is possible on any metric if you create a custom nagios plugin that uses rrd predictsigma and call it as an active check.
As for root cause detection… I dunno… do association rule learning on anomaly sets extracted from the data generated by the aforementioned predictsigma check?
I’d do it myself, but I’m overworked and underpaid.
I’m in the process of reworking my predictive monitoring check to run under check-mk/omd. It was originally written for plain old nagios/pnp4nagios. It’s not currently ready for release, but you can get the gist of it here: [https://github.com/wezelboy/check_predicted](https://github.com/wezelboy/check_predicted)
The root cause analysis stuff is just an idea. I don’t really have time to implement it.
-P
···
On 12/14/19 1:47 AM, Patrick Gavin wrote:
Predictive monitoring is possible on any metric if you create a custom nagios plugin that uses rrd predictsigma and call it as an active check.
As for root cause detection… I dunno… do association rule learning on anomaly sets extracted from the data generated by the aforementioned predictsigma check?
I’d do it myself, but I’m overworked and underpaid.
`This look pretty good, I’ll go through it. As for root cause detection may be it would be possible to make use of apriori algorithm to compute the metrics. I’ve never worked with it before, but I am really interested. I Need first to make refreshment for
``machine learning.
Thanks a lot
`
···
On 12/17/19 10:14 PM, Patrick Gavin wrote:
I’m in the process of reworking my predictive monitoring check to run under check-mk/omd. It was originally written for plain old nagios/pnp4nagios. It’s not currently ready for release, but you can get the gist of it here: [https://github.com/wezelboy/check_predicted](https://github.com/wezelboy/check_predicted)
The root cause analysis stuff is just an idea. I don’t really have time to implement it.
`Sounds very interesting, even though I wanted to avoid writing all these kind of predictive monitoring scripts, but I think there is no way around it.
I am still wondering how the implementation of the service dependencies discovery and root cause detection would look like? I am not talking about a specific service, rather the whole service dependencies in our applications and infrastructure landscape.
For an example:
We had the situation that some apps go to critical, not providing data, or hangs, as for us admins, we can easily spend 10-20 minutes try to find out why, and at the end it is DB query which takes too long and blocks all other operations, so we wrote local
checks for such incidents, we have a big set of such local checks, unfortunately only the developers and the DB admins know the dependencies. Some times it just a network issue, but identifying the bottle neck can also take some time.
So we are thinking of implementing some kind of service dependencies discovery with root cause detection, I know it is not trivial, I would like to know if it is possible with check_mk and what would be the best approach?
Thanks`
On 12/14/19 1:47 AM, Patrick Gavin wrote:
Predictive monitoring is possible on any metric if you create a custom nagios plugin that uses rrd predictsigma and call it as an active check.
As for root cause detection… I dunno… do association rule learning on anomaly sets extracted from the data generated by the aforementioned predictsigma check?
I’d do it myself, but I’m overworked and underpaid.