DevOps teams often spend far too much time treating recurring symptoms without solving the deeper software and IT issues – making that extra effort to solve problems at source.
In the case of the system administrator, developer or a QA professional, taking a root cause analysis approach can prevent a lot of unnecessary suffering.
Although employed as a deductive problem-solving methodology in almost every industry—from aeronautical engineering to book publishing—root cause analysis (RCA) is especially useful in the arena of software development and IT where complex systems of cause-and-effect relationships are the norm.
Versatility of Root Cause Analysis
Whether you’re maintaining an MMORPG video game consisting of millions of lines of code or monitoring a cloud-hosting solution backed by multiple SANs, understanding how to trace undesirable effects back to their primary cause is essential to keeping your end-users happy.
Fortunately, modern diagnostic software tools are making it easier than ever to perform a thorough root-cause analysis on Web-based applications without breaking a sweat. These tools also give those who deal in APM and website monitoring an advantage over most other industries when it comes to employing RCA—with fewer flowcharts, Excel sheets, or interdepartmental brainstorming sessions required.
One important approach to RCA, known as root-cause failure analysis (RCFA), emphasises that most problems in complex systems can rarely be attributed to a single specific cause. Rather, they are often the result of a series of interlinked “causal factors.”
From poorly educated personnel to design issues to flawed engineering methods, the causal factors behind any problematic event can be ranked in terms of causal culpability while still acknowledging that all of the factors were at play as conditions that, together, spiralled out into the incident.
Understanding Root Cause Analysis
One of the simplest and most common approaches to root cause analysis—as it’s practiced in every field and industry—is the 5-Why approach developed by Sakichi Toyoda, the founder of Toyota Motor Corporation.
As the name implies, the 5-Why method consists of acting like an annoyingly inquisitive child, simply asking the question “Why?” five times in succession, or as many times as might be needed to get to a satisfactory conclusion.
The 5-Why’s and other RCA methods of “causal mapping” are typically illustrated in visual form as cause-effect graphs, with the Fishbone Diagram, or Ishikawa diagram, popularized by Kaoru Ishikawa in 1968 being among the most popular.
Taking into account a range of causal factors—from processes to people to materials and equipment—they begin with a problem and work back to its causes (or vice versa, as in the image below), generally looking something like this:
Not all problems are created equal, and some may require incredibly extensive causal-factor maps to arrive at one or multiple root causes. One should “avoid over-applying root-cause analysis,” advises James Shore at The Art of Agile. “Balance the risk of error against the cost of more process overhead.” Extensive RCA is not a necessary tactic for every problem, but simple versions of it can be helpful in most cases”.
Root cause analysis is simply about determining, very specifically, the when, the where, and the why of a problem at its source, before it can ripple out to affect the end-user of an application or website a second time.
And again, while developers and QA personnel may often need to engage in more traditional methods of RCA, getting together around notepads or whiteboards for extended brainstorming sessions, there are now sophisticated Web-based tools that can do a lot of the job for you, automatically diagnosing the root causes of errors—particularly where Web performance monitoring and real user monitoring are concerned.
The Future: Inductive, Intuitive, and Automated RCA
The traditional practice of RCA is a form of deductive analysis, Sherlock Holmes style, beginning with a known problem and working backward, sifting through the available evidence to identify the culprit.
But taking the opposite approach is also possible. The practice of inductive analysis, such as the methodology known generally as FMEA (failure mode and effects analysis) – in which software testers have to think about the kinds of bugs that might be present in order to write test cases to try to find them – is a form of forward-thinking reasoning that is useful for preventing problems from happening at all.
And while humans can get quite proficient at inductive analysis, certain website monitoring solutions, such as AlertSite UXM, do it better, automatically alerting sysadmins to potential problems before they happen by understanding desired performance indicators and intelligently anticipating future deviations from the norm.
Monitoring your Web application in the background, 24/7, silently watching 100% of user interactions and system responses, these powerful tools are increasingly necessary as split-second delays in site performance can mean success or failure for vendors online, and their ability to quickly and efficiently identify root causes translates into quicker fixes by developers, preventing future delays.
In a world of ever-more-agile development, continuous deployment, and mobile app-based businesses, it’s safe to say that using every tool at one’s disposal to stop problems at their causal roots has never been more essential.
Want to discover how your website is performing? As a SmartBear partner, we are offering you the opportunity to see exactly how your website is performing right this very moment. Request a free website performance report today.