IT Operations Blog AIOps Blog

IT Alerts: From Operational Trenches to AIOps

4 minute read
Patrick Campbell

Most enterprises are undergoing rapid digital transformation to keep up with their competition by embracing innovation with new ways to deliver value at greater speeds and lower cost. For example, retail stores and banks are losing their brick and mortar for online digital experiences that delight customers with convenience and services that are customized and delivered faster.

Behind the scenes, IT operations supports infrastructure and apps in these fast-paced digital environments. IT at its core is more relevant now than ever—likely defining the success of business in today’s digital economy.

Being able to deliver performance and availability for the latest digital experience initiatives requires IT operations to have a strategic approach that helps find and fix issues and prevent costly down or slow time better than ever.

Operational trenches for IT

Day-to-day IT deals with hundreds and thousands of incidents tickets often associated with alerts that come from the performance of apps and infrastructure within the environment. If a ticket was generated for each alert and if IT operations has to address everything that occurred related to each alert, the volume of alerting information would easily overwhelm both IT operations and IT service management.

Here are some common issues that occur in the operational trenches:

False positive alerts from static thresholds—Static thresholds are set using a best practice approach for key metrics, however in some cases they are difficult to set accurately to fit each scenario. For example, thresholds for memory consumption across a set of servers could trigger an alert when memory spikes temporarily or for a server that can handle a higher consumption of memory without issue.

Unprioritized and duplicate events from the same source—Many enterprises have multiple monitoring solutions or multiple related alerts for the same issue that could also create an event storm that seizes up resources when only one notification and incident ticket would be more practical.

Alerting information not routed to the right teams who can fix issues—Once an issue is determined as critical, getting it to the right team based on responsibility of the infrastructure, app, or service could be another nightmare if the alert and incident ticket only includes the issue but no indication of where to route the information to a team who can address most effectively.

Top three best practices for operational success

Reduce noise

Make sure that the amount of tickets from alerts and event management systems does not get out of hand by reducing the amount of irrelevant information that can tie up resources unnecessarily.

Here are some ways to achieve this goal:

  • Aggregate the storm of same events so that when an alarm is triggered, create one event and then update a counter instead of having each alarm trigger another event in the system
  • Use identifiers to correlate related events into one single event that provides all the relevant information to address the issue
  • Plan a “blackout” period for maintenance windows as well as suppress events that no longer have relevance

Simplify workflows

Most enterprises have a range of vendor tools to manage their IT from legacy monitoring, through acquisitions, as well as their own innovation. As a result, IT operations might have a slew of options for event management that address issues differently and integrate differently with the environment.

Here are some ways to simplify workflows in this environment:

  • Have visibility of events and data from multiple sources in a single console instead of having to log in to multiple consoles from different vendor tools
  • Associate service models from configuration management data base (CMDB) information with your event management system to add service impact to events associated with the configuration items (Cis) to help prioritize and have more intelligence built in to the events
  • Use dynamic grouping for events for specific roles so that teams responsible for addressing the events in the system have access to relevant views instead of having to sort through a bunch of events not related to their responsibility

Integrate with ticketing

Being able to integrate IT operations with IT service management in the generation and assignment of tickets can greatly reduce the mean time to repair (MTTR) issues for any digital environment.

Here are some examples of integration:

  • Convert known actionable events for device availability and performance to service desk incidents
  • Route specific event and incident data to responsible teams who can address issues quickly and identify patterns over time to configure automated remediation to avoid having to address manually
  • Automatically update both related events and service desk incidents with remediation status so that when these events occur the end-to-end flow from event to closure is more efficient

Going beyond operational event management to machine-assisted learning with AIOps

Whenever possible, you’ll want to correlate and prioritize events from all areas of your on-premises and public cloud infrastructure, automatically generate incident tickets, notifying the service desk before users become aware of the problem, as well as integrate and analyze events from third-party monitoring solutions.

Operational alerting and event management must now be coupled with the ability to do advanced machine learning and analytics from an Artificial Intelligence IT Operations (AIOps) platform. Enterprises need advanced ways to triage and automate IT to deliver business value competitively.

With TrueSight at BMC, you can go beyond the basics for event management to make intelligence decisions based on the volume and velocity of service ticket and event management data using AIOps approaches. You can leverage machine learning on a big data platform to find root cause issues and business impact issues that can be addressed to further reduce the noise of event management.

For example, with this approach, you can correlate the time to resolution for specific events with business value to determine if you might need more training for some technology issues in your organization or to justify converting some manual activity to automation.

Get the free 2019 Gartner Market Guide for AIOps Platforms

Artificial intelligence is already changing the way IT Ops groups work—but what’s the full potential of this technology, and how best can you realize it? Get your copy of the latest Gartner AIOps Guide to learn more.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

BMC Bring the A-Game

From core to cloud to edge, BMC delivers the software and services that enable nearly 10,000 global customers, including 84% of the Forbes Global 100, to thrive in their ongoing evolution to an Autonomous Digital Enterprise.
Learn more about BMC ›

About the author

Patrick Campbell

Patrick T. Campbell has spent his 20+ year career equally between Application and Network Performance Management and K-12 Education. As a Technical Marketing Engineer, he began his career in IT at InfoVista as a Technical Trainer, followed by Raytheon Solipsys, OPNET Technologies (Riverbed Technology), and now BMC Software. In K-12 Education, he taught mathematics at Drew College Preparatory School for seven years and then worked at the University of Maryland Baltimore County (UMBC) as a Mathematics and Science Professional Development Program Co-Director for International Teacher-Scholars from Egypt for another two. Passionate about learning, he has presented at OPNETWORK and at NAIS Teacher Conferences. Patrick received a B.S. in Industrial and Management Systems Engineering from Penn State, and has a Master’s Degree in Human Resource and Behavioral Science from Johns Hopkins University.