Anatomy of a Major-Incident Postmortem

Structuring the Incident Review Process

IT organizations are unlikely to conduct an effective, detailed analysis of contributing causes of a major incident without strong executive sponsorship and a dedicated process owner.

Due to the cross-organizational nature of the major incident review process and the required commitment of time and resources, successful implementation of a major incident postmortem process must start with sponsorship from the CIO and the senior IT management team.

Once sponsorship has been secured, the first step is to create a problem manager role and establish a problem review board to serve as a process development group. With the problem manager serving as the lead, the review board must first set the criteria for when an incident rises to the level of “major.”

ITIL defines major incidents as “those for which the degree of impact is extreme” and “for which the timescale of disruption — to even a relatively small percentage of users becomes excessive …”.

As a general rule of thumb, the following definition works well as a starting point for many organizations: Whenever a service impact occurs on a critical business system and extends for more than an hour.

The next step for the problem review board is to document the incident review process, meeting guidelines, and templates for capturing incident chronologies. Charter the problem review board to meet within three-to-five business days following a major incident.

During these meetings, the problem manager will be responsible for scheduling and managing the review meetings, capturing action items and developing and executing service improvement plans.

A powerful message is created if the CIO or another senior IT staff member volunteers to serve as executive chairperson and attends review meetings whenever possible. Executive involvement serves as behavior model for the organization and reinforces the importance of the board’s role and the organization’s commitment to improving service.

Other attendees should include those technical managers and staff who were involved in the chronology of the outage.

Focusing the Incident Review Process

The postmortum process will fail miserably if the problem review board is used as a forum to identify the person or organization at fault. Although it’s tempting to place blame — especially if software vendors, contractors, or outsource vendors are involved — it will be impossible to gather all the pertinent facts when the people involved will be focused on covering their tracks.

The problem manager, serving as chairperson, must insure that fact-finding is objective, positive, and focused on the offending processes. To be successful, the meeting must stay focused on actions that could or should be taken going forward to reduce risk and recurrence.

Capturing the Incident Chronology

Ideally, an incident ticketing system serves as the repository for capturing information about the incident, and the ticket serves as the outage history. For smaller organizations that may not have a ticketing system, an email summary of events built from shift logs and participant notes can work just as well to facilitate the review.

Regardless of the method used, the importance of a good, high-level summary cannot be overstated. It serves as the instrument for zeroing-in on the key questions underlying each review meeting:

  • Is there anything that could have been done to prevent or avoid this event?
  • How could we have shortened the duration?
  • What action can be taken to avoid or shorten its impact in the future?
  • Building the Action Plan

    Through analyzing the chronology, a comprehensive action plan is then documented for follow-up. While some of the underlying causes may remain unknown at the time of the meeting, these can be captured as open action items to be closed when final research is completed.

    An action item matrix that captures the action, person assigned, and a due date for follow-up, will serve the purpose of reducing future risk.

    The Postmortum in a Service Management Culture

    In rolling out the process, reinforce with the organization from the outset that a postmortum process has only one goal: To drive service improvement.

    An overview of the process, management expectations and the goal of the reviews should be discussed with all staff members before the first meeting. The message should be reinforced with participants as the purpose at each meeting.

    In IT environments today, where systems typically underlie the organization’s most critical business functions, and where a system failure can mean revenue impact, missed business commitments and customer dissatisfaction, instituting a postmortum review process can be a positive early step on the road to establishing a service management culture.

    Brian Corrington is the President and CEO of Codesic Consulting, an IT consultancy headquartered in Kirkland, Washington. Corrington has over twenty years of experience managing enterprise-scale IT projects, building technology practices, and developing strategic customer relationships.