Building a systematic root-cause analysis (RCA) pipeline

Image credit: designer491 | shutterstock.com

Fault / Alarm management tools have lots of strings to their functionality bows to help operators focus in on the target/s that matter most. ITU-T’s recommendation X.733 provided an early framework and common model for classification of alarms. This allowed OSS vendors to build a standardised set of filters (eg severity, probable cause, etc). ITU-T’s recommendation M.3703 then provided a set of guiding use cases for managing alarms. These recommendations have been around since the 1990’s (or possibly even before).

Despite these “noise reduction” tools being readily available, they’re still not “compressing” event lists enough in all cases.

I imagine, like me, you’ve heard many customer stories where so many new events are appearing in an event list each day that the NOC (network operations centre) just can’t keep up. Dozens of new events are appearing on the screen, then scrolling off the bottom of it before an operator has even had a chance to stop and think about a resolution.

So if humans can’t keep up with the volume, we need to empower machines with their faster processing capabilities to do the job. But to do so, we first have to take a step away from the noise and help build a systematic root-cause analysis (RCA) pipeline.

I call it a pipeline because there are generally a lot of RCA rules that are required. There are a few general RCA rules that can be applied “out of the box” on a generic network, but most need to be specifically crafted to each network.

So here’s a step-by-step guide to build your RCA pipeline:

  1. Scope – Identify your initial target / scope. For example, what are you seeking to prioritise:
    1. Event volume reduction to give the NOC breathing space to function better
    2. Identifying “most important” events (but defining what is most important)
    3. Minimising SLA breaches
    4. etc
  2. Gather Data – Gather incident and ticket data. Your OSS is probably already doing this, but you may need to pull data together from various sources (eg alarms/events, performance, tickets, external sources like weather data, etc)
  3. Pattern Identification – Pattern identification and categorisation of incidents. This generally requires a pattern identification tool, ideally supplied by your alarm management and/or analytics supplier
  4. Prioritise – Using a long-tail graph like below, prioritise pattern groups by the following (and in line with item #1 above):
    • Number of instances of the pattern / group (ie frequency)
    • Priority of instances (ie urgency of resolution)
    • Number of linked incidents (ie volume)
    • Other technique, such as a cumulative/blended metric

5. Gather Resolution Knowledge – Understand current NOC approaches to fault-identification and triage, as well as what’s important to them (noting that they may have biases such as managing to vanity metrics)

6. Note any Existing Resolutions – Identify and categorise any existing resolutions and/or RCA rules (if data supports this)

7. Short-list Remaining Patterns – Overlay resolution pattern on long-tail (to show which patterns are already solved for). then identify remaining priority patterns on the long-tail that don’t have a resolution yet.

8. Codify Patterns – Progressively set out to identify possible root-cause by analysing cause-effect such as:

  • Topology-based
  • Object hierarchy
  • Time-based ripple
  • Geo-based ripple
  • Other (as helped to be defined by NOC operators)

9. Knowledge base – Create a knowledge base that itemises root-causes and supporting information

10. Build Algorithm / Automation – Create an algorithm for identifying root-cause and related alarms. Identify level of complexity, risks, unknowns, likelihood, control/monitoring plan for post-install, etc. Then build pilot algorithm (and possibly roll-back technique??). This might not just be an RCA rule, but could also include other automations. Automations could include creating a common problem and linking all events (not just root cause event but all related events), escalations, triggering automated workflows, etc

11. Test pilot algorithm (with analytics??)

12. Introduce algorithm into production use – But continue to monitor what’s being suppressed to

13. Repeat – Then repeat from steps 7 to 12 to codify the next most important pattern

14. Leading metrics – Identify leading metrics and/or preventative measures that could precede the RCA rule. Establish closed-loop automated resolution

15. Improve – Manage and maintain process improvement

Be the first to comment

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.