Simon couldn’t believe it. Once again, the new, expensive system that was put in place over a year ago was on the fritz. The reporting engine, which produced reports automatically in about 20 seconds, was taking more than 60 minutes per report. Even worse, some reports were never coming out.
Nothing had changed, no patches had been installed. Everything was working fine and all of a sudden performance slowed to a crawl—for no apparent reason. What’s worse, reactive measures that worked before were no longer working: verifying memory consumption, recycling services, even rebooting servers.
A problem escalation team (PET) was set up to address the issue. Simon, a project manager for this multibillion dollar, global company, headed the team. System administrators and DBAs we brought on board to monitor SQL queries, CPU usage, memory usage and so on. After two days the problem had not been resolved so vendors were brought in also: the database manufacturer, the vendor who delivered the application, and the vendor of the reporting engine. Various methodical diagnostic procedures began.
One of the procedures involved producing a report interactively instead of automatically, to see how much time it took to run. During one of those tests, Simon made a seemingly innocuous comment: “I wonder why we aren’t seeing the X’s in the red squares for that report.” There was a brief moment of silence, but the comment was dismissed because everything else looked correct, and the report was produced in less than 20 seconds.
On the fourth day, an external consultant with expertise on the report engine was called upon. While looking over the test results, someone from the PET team recalled Simon’s comment. The consultant looked deeper at the results and, of course, the missing X was the root cause of the problem. It was fixed quickly, the fix was uploaded to the production servers and since then, everything has been working smoothly.
From the moment the incident began, until the time it was resolved, many members of the PET team were on conference calls and Live Meeting sessions for more than ten hours at a time. Yet, during all this time, nobody complained loudly, and in the end the problem was resolved and taken care of relatively quickly.
Crises such as these are good for IT because they allow you to test the quality and strength of your teams and procedures. With proper resolution, it also helps improve your systems infrastructure.
To get through a crisis, the following guidelines can help:
Focus on the business: During the system outage, there was little time and energy spent dealing with personal issues. According to Simon, “The most important thing is to get the system up and running and service the business.” The reporting system oversees operations for over 70 different sites across the U.S. “Our job is to keep the business running. We impact a $1 billion business.”
Expertise comes first: People were included on the PET team based on their expertise and skills, not on their job description or their project assignments. Some of the people present on the calls would normally not have worked on resolving the issue. People who had moved on to other projects had been asked to drop what they were doing in order to help out. They did so gracefully.
Everybody contributes: There is always a person on-call who expects to be interrupted at any time to resolve system outages. However, that person is not responsible for fixing everything. If there are people with more appropriate skills, they can be brought on as needed.