Revolutionizing Zabbix Monitoring with AI: Introducing Zabbix MCP Next
Hello everyone, Dimitri Bellini here! If you work in IT monitoring, you already know the daily struggle. Our consoles are absolutely packed with information. While having data is great, making sense of it when your infrastructure generates hundreds of alerts every minute is a completely different beast.
Yes, the APIs are there to help us focus, but assembling that raw data into something meaningful is difficult. You have to connect the dots to build the insights you actually care about. Today, I want to share my vision for bridging the gap between Artificial Intelligence and Zabbix, and show you exactly how we can turn monitoring chaos into absolute clarity.
The Problem: Why the "Naive" AI Approach Fails
We’ve talked before about using Large Language Models (LLMs) with the Model Context Protocol (MCP). The standard approach is to simply provide the AI with the methods to pull data directly from your tools. Since Zabbix 7.0 exposes a fantastic and comprehensive API, this sounds perfect on paper, right?
In reality, it introduces a massive level of complexity. The Zabbix API is powerful, but a single call can require 20 to 30 parameters. When you feed this raw JSON data directly to an AI, you run into several critical roadblocks:
- Data Overload & Noise: The AI receives a massive wall of unstructured background noise.
- Context Limits: LLMs have a specific context window. If you flood it with raw JSON, you leave no room for actual analysis or follow-up questions.
- High Costs: If you use a cloud-based AI, you pay per token. Sending massive Zabbix payloads will drain your budget incredibly fast.
- Hallucinations: You are essentially asking the AI to be both a Zabbix specialist and a Business Intelligence (BI) analyst. When overwhelmed with data, the AI will start to wander and hallucinate facts.
The Solution: The Intelligence Layer (Zabbix MCP Next)
To solve this, I developed a custom backend solution I call Zabbix MCP Next (or Zabbix-mcp-nxt). Instead of letting the AI blindly query the Zabbix API, this backend acts as an "Intelligence Layer" or a project manager.
When the AI needs information, it talks to Zabbix MCP Next. This backend contains pre-configured logic designed by human experts. It optimizes the research, heavily filters the background noise, and only sends the exact subset of data the AI actually needs. This drastically reduces token usage, speeds up inference, and eliminates hallucinations.
My Local AI Tech Stack: Privacy, Speed, and Performance
To prove this works without racking up massive cloud bills, I deployed the entire solution locally. A huge shoutout and thank you to Raffaele for setting up the host machine in record time!
Here is the stack we used to keep our data secure and entirely on-premises:
- GPU Acceleration: NVIDIA RTX 8000
- LLM: Qwen 3.5
- Backend Engine: vLLM. (Quick tip: Ollama is great for playing around, but if you want to do something serious and super-optimized in production, vLLM is the engine you need all your life.)
- Frontend Interface: Open WebUI
- Monitoring Platform: Zabbix 7.0
By running Qwen 3.5 locally via vLLM, we bypass connectivity limits, ensure complete data privacy, and avoid the per-token costs of cloud providers.
Real-World Examples: Seeing Zabbix MCP Next in Action
Theory is great, but what are the actual results? I tested this on both a demo environment and a production machine. Here is what the AI was able to accomplish.
1. Summarizing Critical Infrastructure Chaos
I simply asked the AI, "Give me the critical problems of the day." Instead of a messy list, it reorganized everything to focus my attention. It provided a clean summary noting that a specific network interface had been down for 64 days, a Windows machine was unreachable for 115 days, and a datacenter temperature sensor was alarmed. Out of 18 active problems, it highlighted the most important ones and provided actionable recommendations—even suggesting which project owner to contact!
2. Deep Dive into Specific Hosts
I asked for information on a host named "UniFi". Using MCP tool calling, the AI intelligently triggered host.get, problem.get, and history.get. Because of our optimized backend, it only pulled the data for that specific interface. The entire chat used about 14,000 tokens—well within my local 32,000 token limit—giving me a perfect, concise health check.
3. Performance Troubleshooting (Zabbix Server & PostgreSQL)
I asked the AI to inspect the Zabbix Server host for anomalies over the last 7 days. It didn't just guess; it checked the assigned templates, identified the relevant metrics (CPU, memory, ICMP ping), and scanned the history. It successfully identified a specific ICMP ping spike on April 17th and correlated it with the Zabbix Housekeeper process cleaning up the database!
I did the same for a PostgreSQL database. The AI extracted active connections, query backends, transactions, and background writer metrics to build a comprehensive, DBA-level report. Reading database graphs and identifying bottlenecks isn't a trivial operation, but the AI handled it beautifully.
Conclusion
While generating these deep analytical responses takes a few minutes, the value is undeniable. Imagine scheduling this system to automatically generate and send a comprehensive, easy-to-read infrastructure status report to your management team every six hours. It truly changes the way we interact with monitoring data.
I plan to carry this idea forward and potentially bring it to production for a few customers. It is a massive step up from the complex, manual dashboards we are used to.
What do you think of this approach to using AI in the world of IT monitoring? Let me know your thoughts in the comments below—your opinion is incredibly important to me!
If you found this content interesting, please subscribe to my channel and join our community:
- 📺 YouTube: Quadrata
- 💬 Telegram: ZabbixItalia
That's all for today. A big greeting from Dimitri, bye everyone!