Artificial Intelligence (AI) Anomaly Detection and Correlation

acziryak · July 25, 2023, 4:11pm

Hello all,

I figured since there was nothing in the Roadmap about this, I’d probably get more engagement if I was looking at this from a plugin possibility.

I was wondering if anyone had started considering using AI to do some of the analysis that CheckMk is currently doing, to try to be more intelligent at anomaly detection and correlation

I’ve seen Dynatrace’s results, and they’re pretty impressive Of course, there’s just something about that product that rubs me the wrong way, but I can definitely see how it’s beneficial to have so many data sources being intelligently analyzed in the aggregate.

With the popularity of Introduction - Hugging Face NLP Course and 🦜️🔗 LangChain | 🦜️🔗 LangChain, (both python libraries BTW) it seems like the time is ripe to experiment with integrations. But being a lowly sysadmin, I don’t have the developer mindset or tooling to be able to put a robust solution together… but I can hack the heck out of a python script.

Has anyone worked with doing something like dynamically adjusting parameters of WARN and CRIT thresholds? Or exposing service metrics? Or working with TCP connection data? I figure those areas are gunna be at least some of the fundamental building blocks of how any AI would gather essential information about the functioning of the whole environment.

Anyways, sorry for not being able to provide anything concrete I was hoping that maybe thinking aloud could be of use to someone else other than myself.

elias.voelker · July 25, 2023, 4:32pm

hey @acziryak

great discussion starter! A lot to unpacke here.

Checkmk can do this since years (before it was cool to slap ‘AI’ on everything that goes beyond a simple linear regression ) See more here: Predictive monitoring

As for event correlation and anomaly detection: There is work going into this in different corners of the community. Of course, we do some R&D there, but some of our partners and customers are also doing some fancy stuff. Take a look at this presentation from our partner Comnet: https://www.youtube.com/watch?v=FUNsKH4Pla4 (@fabian.binder )

And last but not least, there are integrations with ‘AIOps’ tools like Moogsoft that are being built by people like @pauloadriano

So lots going on here!

acziryak · July 26, 2023, 3:11pm

Great stuff! And I super appreciate the link to the talk recording. That promises to be some good watching.

pauloadriano · August 2, 2023, 11:01am

@acziryak

Be aware that there`s no magic in that area. I believe that walking towards a solution to consolidate current alerts to help identify the root cause or even consolidate alerts with some relationship in a timeline would help a lot.

From my perspective I would prefer to start monitoring IT Services (Application A, ERP B, DHCP, DNS, etc) using Business Inteligence Module and based on that generate incidents, usually people dont care if the server is at 100% CPU utilization or even with high memory utilization until the application start becoming slow, and that what`s need to be monitored and correlated with a root cause.

Moogsoft can assist consolidating alerts from various sources to generate an unique incident.

Now about your question.

“Has anyone worked with doing something like dynamically adjusting parameters of WARN and CRIT thresholds?”

Adjusting WARN and CRIT thresholds can`t be done automatically BUT

You can define thresholds based on Labels, so you can have different values on Production and Test servers for instance.
You can define thresholds based on time specific parameters, so according to business hours you could apply different thresholds
for CPU and Memory you can use average values (Not support in all hosts types), in that case you will not have alerts for peak usage but will have alerts for constant high utilization.
For file system you can enable MAGIC Factor, that`s a huge thing that I highly reccomend using. According to the size of your filesystem checkmk can calculate different WARN and CRIT values. Filesystem with 10TB will need 98% of utilization to reach CRIT while a Filesystem with 100Gb will need 91% to reach CRIT, that calculation is done automatically based on Filesystem size.

“Or exposing service metrics?”

You can export your data to InfluxDB or Graphite and then start your analysis.

“Or working with TCP connection data?”

Not clear but you can also query data using Livestatus API and REST API (Recomended)

About dynatrace I tested that but its a black box, you can use automatic thresholds or static thresholds. On Automatic Thresholds you enable it and theres nothing to do, if its a more sensitive server you can`t define that.

Please share updates about this project, willing you the best!