A major system is down for the second time in as many weeks, customer orders aren’t being processed, and operations can’t provide an estimated recovery time. Sound familiar?
As a CIO, one of the most important steps you can take to prevent recurring outages or major incidents is to conduct rigorous, constructively-focused postmortems.
A well-designed postmortem process can be used to develop comprehensive IT action plans and serve as a powerful building block in launching an overall service improvement program, which may also involve implementation of a best practice framework such as ITIL (Information Technology Infrastructure Library).
ITIL describes the process framework for Incident Management and Problem Management, both of which play key roles in minimizing user down time. Although the ITIL framework endorses an incident postmortem process, it does not provide a detailed framework for it.
The following is a proven strategy for developing and implementing a postmortem process:
Probing for Contributing Causes
IT organizations are generally effective at assessing a system failure to identify a single root cause, such as a hardware failure or a missing security patch. An action plan to implement the missing patch, for example, should reduce the future risk from this particular type of failure.
In this particular case, the true root cause may remain unresolved because the effectiveness and reliability of the patch management process has not been investigated for gaps. By delving deeper into all of the contributing causes of a major incident, we may uncover a great deal of additional, highly valuable information.
The postmortem review is designed to probe for those other factors that contribute to impact and downtime.
The process looks at the full chronology of events that make up the incident life-cycle including factors such as change management processes, cross-group communications, training, documentation and human error.
A good review meeting asks such questions as:
Writing-off an outage to a single, high-level root cause without this further analysis is like a coroner skipping the autopsy and listing as cause of death: “hit by a bus.”
Like an autopsy, a thorough postmortem review should look at all likely causes of a failure, including the organizational behaviors that may contribute to the failure and delays in resolution. Only this more comprehensive analysis will lead to an understanding of the often complex relationship between people, processes and events that come into play before and during a system failure.