Anatomy of a Major-Incident Postmortem

A major system is down for the second time in as many weeks, customer orders aren’t being processed, and operations can’t provide an estimated recovery time. Sound familiar?

As a CIO, one of the most important steps you can take to prevent recurring outages or major incidents is to conduct rigorous, constructively-focused postmortems.

A well-designed postmortem process can be used to develop comprehensive IT action plans and serve as a powerful building block in launching an overall service improvement program, which may also involve implementation of a best practice framework such as ITIL (Information Technology Infrastructure Library).

ITIL describes the process framework for Incident Management and Problem Management, both of which play key roles in minimizing user down time. Although the ITIL framework endorses an incident postmortem process, it does not provide a detailed framework for it.

The following is a proven strategy for developing and implementing a postmortem process:

Probing for Contributing Causes

IT organizations are generally effective at assessing a system failure to identify a single root cause, such as a hardware failure or a missing security patch. An action plan to implement the missing patch, for example, should reduce the future risk from this particular type of failure.

In this particular case, the true root cause may remain unresolved because the effectiveness and reliability of the patch management process has not been investigated for gaps. By delving deeper into all of the contributing causes of a major incident, we may uncover a great deal of additional, highly valuable information.

The postmortem review is designed to probe for those other factors that contribute to impact and downtime.

The process looks at the full chronology of events that make up the incident life-cycle including factors such as change management processes, cross-group communications, training, documentation and human error.

A good review meeting asks such questions as:

  • How effective was the incident diagnosis and response?
  • Did we engage the right people at the right time?
  • Did communication, vendor coordination and escalation work as planned?
  • How could the recovery process have been shortened once a fix was identified?
  • Was there a process failure that contributed to the likelihood of the incident occurring?

    Writing-off an outage to a single, high-level root cause without this further analysis is like a coroner skipping the autopsy and listing as cause of death: “hit by a bus.”

    Like an autopsy, a thorough postmortem review should look at all likely causes of a failure, including the organizational behaviors that may contribute to the failure and delays in resolution. Only this more comprehensive analysis will lead to an understanding of the often complex relationship between people, processes and events that come into play before and during a system failure.