Demystifying AI in Zabbix: Can AI Really Correlate Events?

Good morning, everyone! Dimitri Bellini here, back with you on Quadrata, my YouTube channel dedicated to the open-source world and the IT topics I'm passionate about. This week, I wanted to tackle a question that I, and many members of the Zabbix community, get asked all the time: Why doesn't Zabbix have more built-in AI?

It seems like every monitoring product out there is touting its AI capabilities, promising to solve all your problems with a touch of magic. But is it all hype? My colleagues and I have been digging deep into this, exploring whether an AI engine can truly correlate events within Zabbix and make our lives easier. This blog post, based on my recent video, will walk you through our thought process.

The AI Conundrum: Monitoring Tools and Artificial Intelligence

Let's be honest: integrating AI into a monitoring tool isn't a walk in the park. It requires time, patience, and a willingness to experiment with different technologies. More importantly, it demands a good dose of introspection to understand how all the pieces of your monitoring setup fit together. But why even bother?

Anyone who's managed a complex IT environment knows the struggle. You can be bombarded with hundreds, even thousands, of alerts every single day. Identifying the root cause and prioritizing issues becomes a monumental task, even for seasoned experts. Severity levels help, but they often fall short.

Understanding the Challenges

Zabbix gives us a wealth of metrics – CPU usage, memory consumption, disk space, and more. We typically use these to create triggers and set alarm thresholds. However, these metrics, on their own, often don't provide enough context when a problem arises. Here are some key challenges we face:



Our Goals: Event Correlation and Noise Reduction

Our primary goal is to improve event correlation using AI. We want to:



Possible AI Solutions for Zabbix

So, what tools can we leverage? Here are some solutions we considered:



Leveraging scikit-learn

One promising solution we explored involves using scikit-learn, an open-source machine learning library. Our proposed pipeline looks like this:





  1. Event Processing: Collect events from our Zabbix server using streaming capabilities.


  2. Encoding Events: Use machine learning techniques to vectorize and transform events into a usable format.


  3. Cluster Creation: Apply algorithms like DBSCAN to create clusters of related events (e.g., network problems, operating system problems).


  4. Merging Clusters: Merge clusters based on identified correlations.

A Simple Example

Imagine a scenario where a router interface goes down and host B becomes unreachable. It's highly likely that the router issue is the root cause, and host B's unreachability is a consequence.

Implementation Steps

To implement this solution, we suggest a phased approach:





  1. Temporal Regrouping: Start by grouping events based on their timing.


  2. Host and Group Context: Add context by incorporating host and host group information.


  3. Semantic Analysis: Include semantic analysis of problem names to identify connections.


  4. Tagging: Enrich events with tags to define roles and provide additional information.


  5. Iterated Feedback: Gather feedback from users to fine-tune the system and improve its accuracy.


  6. Scaling Considerations: Optimize data ingestion and temporal window size based on Zabbix load.

Improvements Using Existing Zabbix Features

We can also leverage existing Zabbix features:



The Reality Check: It's Not So Simple

While the theory sounds great, real-world testing revealed significant challenges. The timing of events in Zabbix can be inconsistent due to update intervals and threshold configurations. This can create temporary discrepancies and make accurate correlation difficult.

Consider this scenario:



A human might intuitively understand that a full file system could cause a database instance to fail, which in turn could bring down a CRM application. However, a machine learning algorithm might struggle to make these connections without additional context.

Exploring Large Language Models (LLMs)

To address these limitations, we explored using Large Language Models (LLMs). LLMs have the potential to understand event descriptions and make connections based on their inherent knowledge. For example, an LLM might know that a CRM system typically relies on a database, which in turn requires a file system.

However, even with LLMs, challenges remain. Identifying the root cause versus the symptoms can be tricky, and LLMs might not always accurately correlate events. Additionally, using high-end LLMs in the cloud can be expensive, while local models might not provide sufficient accuracy.

Conclusion: The Complex Reality of AI in Monitoring

In conclusion, integrating AI into Zabbix for event correlation is a complex challenge. A one-size-fits-all solution is unlikely to be effective. Tailoring the solution to the specific needs of each client is crucial. While LLMs offer promise, the cost and complexity of using them effectively remain significant concerns.

We're continuing to explore this topic and welcome your thoughts and ideas!

Let's Discuss!

What are your thoughts on using AI in monitoring? Have you had any success with similar approaches? Share your insights in the comments below or join the conversation on the ZabbixItalia Telegram Channel! Let's collaborate and find new directions for our reasoning.

Thanks for watching! See you next week!

Bye from Dimitri!

Watch the original video: Quadrata Youtube Channel