Overcoming 'Alert Fatigue' With AIOps
“Alert fatigue” is a real thing for IT admins. You need to monitor your environment at all times, watching for current—as well as potential—issues. But too many alerts can quickly turn into a “boy who cried wolf” scenario: so many alerts come in that you start to tune them out.
At that point, you may turn up your thresholds, to filter out the alerts and reduce the number of “false positives.” But that also comes with huge risks: you may miss really crucial alerts, which could harm your data center, not to mention your career. Finding that happy medium between alert fatigue and not missing important notifications is a bit of an art form.
Add to that the fact that monitoring has gotten more complicated in the cloud era. If you’ve moved anything into the cloud, whether it’s data, infrastructure, applications, or whatever, you have another whole realm to monitor. The majority of companies these days are hybrid, in the sense that they have workloads in both on-premises and cloud environments. That’s another layer of responsibility, and another set of alerts to keep you busy. It seems like it’s almost impossible to keep up.
That’s why AIOps is becoming a thing. Gartner defines AIOps—Artificial Intelligence for IT Operations—this way:
AIOps refers to multi-layered technology platforms that automate and enhance IT operations by 1) using analytics and machine learning to analyze big data collected from various IT operations tools and devices, in order to 2) automatically spot and react to issues in real time.
Harnessing the power of AI is crucial, as humans just can’t process the information fast enough. AI in this use case separates signal from noise, making alerts more insightful and actionable while ensuring you’re not flooded with so many notifications that you eventually turn your phone off.
LogicMonitor is a well-known SaaS-based monitoring platform that works with both on-prem and hybrid setups. They’ve incorporated AI into their product, and earlier this week added a new feature to it that brings monitoring to the next level.
It’s called “anomaly detection,” and it uses historical analysis to spot potential trouble spots. LogicMonitor described how it works in a press release:
“Anomaly detection allows customers to see deviations that occur for monitored resources and compare these anomalies to key historical signals. This additional layer of intelligence helps customers better understand the expected performance of their monitored resources and identify when performance is deviating from expectations.”
Gadi Oren, VP of Products at LogicMonitor, told me recently that anomaly detection helps customers better understand what monitored behaviors are normal or expected, after establishing a baseline. What’s typical behavior, for example, for your database? For your ecommerce site?
“AIOps anomaly detection is about a bigger point of view,” Oren said, helping explain “much better what’s happening when something breaks: why it could be happening and how it could be happening.”
A key benefit for admins of anomaly detection is quick identification of the problem. Rather than spending hours tracking down the source of the issue, the AI informs them much faster than they could figure it out themselves. They can then spend their time solving the problem. That’s AIOps in action.
LogicMonitor works at both the physical and logical layers, so they can monitor your hardware switches as well as your VMware (or other) virtual machines, for instance. It’s part of their “dynamic topology mapping” process, which enables discovery of relationships between devices in an environment, providing the context needed for anomaly detection.
That topology, for example, could be a cable that connects a server with a switch, which is connected by another cable to a firewall. The topology can represent any kind of dependency, and they’re all expressed as physical connections.
It’s interesting to put the power of AI to use for something as seemingly mundane as infrastructure monitoring. We tend to think of AI being used for cutting-edge stuff like augmented and virtual reality, robotics, and other high-end applications. But making life easier for an admin is just as important.