In the world of technology and software development, you are always trying out something new—only to test it again. Engineers learn from their mistakes and use them to grow their skillsets and improve processes. But some mistakes, like a major network or infrastructure failure, are less forgiving. The result of these unintended problems is a thing of nightmares.
Fortunately, a systematic approach available helps engineers and developers find the beginning of a problem and discover what went wrong: root cause analysis. In this article, we’ll look at RCA in IT environments, including:
Root cause analysis (RCA) is a systematic process for finding and identifying the root cause of a problem or event.
RCA is based on the basic idea that having a truly effective system means more than just putting out fires all day. That’s why RCA starts with figuring out how, where, and why the issue appeared. Then it goes further: RCA strives to respond to that answer—in order to prevent it from happening again.
These all-new for 2020 ITIL e-books highlight important elements of ITIL 4 best practices. Quickly understand key changes and actionable concepts, written by ITIL 4 contributors.
Originating in the field of aeronautical engineering, this method is now applied in virtually every industry, but with particular focus and benefits in software development. Finding the root cause of a software or infrastructure problem is a highly effective, quality engineering technique that many industries already mandate in their governance.
Root cause analysis is considered a reactive management approach. In the ITIL® framework for service management, for instance, incident management is a reactive move where you’re responding to a critical incident. Problem management, on the other hand, is a proactive approach wherein you’re seeking out problems to address. (Learn more in Incident Management vs Problem Management.)
RCA has a wide range of advantages (detailed below), but it is dramatically beneficial in the continuous atmosphere of software development and information technology for two main reasons:
Even though performing root cause analysis might feel time consuming, the opportunity to eliminate or mitigate risks and root causes is undeniably worthwhile.
Some of the basic principles of RCA can help organizations ensure they are following the correct methodology:
The specific map of root cause analysis may look slightly different across organizations and industries. But here are the most common steps, in order, to perform RCA:
Let’s look at these steps in detail.
Even if you don’t expect the problem to occur again, plan as if it will.
Remember, in order to have an effective RCA it is important that the team recognizes that processes cause the problems not people. Pointing fingers and placing blame on specific workers will not solve anything.
(Learn more about the importance of a blameless culture when performing an incident postmortem—the final step of your root cause analysis.)
You can perform RCA using a variety of techniques. We highlight four well-known RCA techniques below—use the technique that meets your specific situation. Here’s a simple distinction:
Take a look at these options and consider which might be best for your situation:
One of the simplest and most commonly utilized tools in conducting an RCA is the 5 Whys method. Mimicking curious children, the 5 Whys method literally suggests that you ask “Why?” five times in a row in order to identify the root cause of basically any process or problem.
5-Why analysis is effective because it is easy to use for solving problems where there is a single root cause.
Even though the method seems explicit enough, this approach is still meant to be flexible depending on the scenario. Sometimes five whys will be enough. Other times, you’ll need to ask “Why?” a few more times. Or, you could use additional techniques to identify the root cause.
To begin this method, follow this outline:
(See the 5 Whys in action with a simple RCA example, below.)
Pareto charts identify the most significant factor among a large set of factors causing a problem or event. A Pareto chart is a combined bar and line chart, where the factors are plotted as bars arranged in descending order. The chart is accompanied by a line graph showing the cumulative totals of each factor, left to right.
You might know the Ishiwaka Diagram by other names: the fishbone, the herringbone, the cause-and-effect, and, our favorite, the Fishikawa diagram.
The Ishikawa diagram is a great visualization tool for brainstorming and discovering multiple root causes. It is shaped like a fish skeleton, with the head on the right and the possible causes shown as fishbones to the left.
Scatter diagrams, or scatter plots, use regression analysis to graph pairs of numerical data to determine relationships. This is helpful to identify problems and events that occur because of fluctuating measurements, such as capacity issues that happen when server traffic increases.
(Learn how to create your own scatter plots using Matplotlib.)
Here is a simple 5 Why analysis where we try to determine why a computer is not turning on. At each step, we ask why the computer is not turned on. We gather data as we follow the power flow, until we finally determine that the power strip the computer plugged into is turned off.
Here’s what the user has reported: Their desktop computer is not turning on. The monitor is turned on, but the user does not hear the computer fan running, and there are no power lights.
Root cause analysis using the 5 Whys to troubleshoot a computer that won't turn on
Resolution: Technician turned on the surge protector and the computer came back on again.
The main benefit of root cause analysis is obvious: identifying problems so you can solve them. RCA offers plenty more benefits that help to solidify its usefulness and importance in the tech environment.
When the right employees get the right RCA and resolution training, you’ll execute correct processes and solve common business problems.
When you catch problems quickly, you reduce the likelihood that those problems will turn into major incidents—especially when RCA is used to support an agile environment. RCA saves valuable employee time and ensures the organization doesn’t other fines or compromises.
Employee safety is vital, and root cause analysis provides an added peace-of-mind. By quickly and effectively investigating any safety incidents, you can solutions can be put into place to prevent anything similar from happening again down the line.
When you follow RCA analysis all the way through to final documentation, you focus on long-term prevention. It also shows that your organization prioritizes solutions—not speedy workarounds.
This forward thinking enables companies to become proactive and productive.
An RCA may show the problem is broken code due to technical debt. If the problem occurred due to changed business requirements, code development compromises, poor coding practices, or software entropy, the real solution may be refactoring rather than patching. Refactoring realigns your code with desired business outcomes, eliminates technical debt, and brings it up to current standards for future agile deployments.
Taking the time to create a robust root cause analysis process may take some time and effort in the initial stages, but it is an investment that will extend far beyond the expenses. The skills learned during the RCA process can be carried over to almost every other problem or field and initiate an attitude of continuous improvement—and even innovation.
This culture will surely permeate your organization for the better.
For more on this topic, explore these resources: