When Mary Delhaye reported yet another issue with the application she owned, she was lucky enough to walk across the office and talk to the service management team directly. At the headquarters of the French financial services company where she works, the developers, service team and business owners are all in close proximity — although the applications they manage are used by 60,000 people worldwide.
As Delhaye spoke to her colleagues about her application had crashing again, she was surprised by their response: “That’s not an incident. That’s a problem.” It was a problem, but the difference between incidents and problems is not something people find easy to define, or to manage in service terms.
“If you’re going down a road full of potholes and you get a flat tire, that’s an incident,” explained Phil Peplow, head of IT Service Delivery at a UK healthcare company. “You can get ’round it by carrying a spare. If you want to make sure it never happens again, you can apply problem management, and work out why it happened in the first place and eradicate the cause; fill in the potholes, or never drive down that road again. That’s what problem management does.”
Neil Almond, consultant at service management company Privitas Solutions, believes that part of why people find it difficult to distinguish problems and incidents is because of our tendency to want to know why things happen.
“I think the main issue people have with the distinction between a problem and an incident is that they try to over-think incidents and start doing root cause analysis, rather than just thinking, ‘How can I get ’round this?’,” he said. “Sometimes the workaround can be almost too obvious like re-routing output to another printer for example.”
Fixing the incident, however, doesn’t mean that it won’t happen again. “The only time at which we can give a guarantee that this issue will never happen again is after potentially lengthy and detailed analysis, and most likely once a change has been implemented that fundamentally alters the infrastructure in some way,” said Almond. “My favorite analogy for this is taking two aspirin when you get a headache; this in no way guarantees that you will never get another headache, but it does address the symptoms, and you do not need to understand why you had the headache in the first place.
“Clearly, there will be a cause for every incident, and one of the biggest challenges we face is accepting the fact that we simply do not have the time, money, resources or, in some cases skills, to establish this cause in every case.”
Incidents are triggered by a deviation from the norm when something goes wrong and day-to-day service is impaired. “For the most part, knowing when to log an incident is easy: whenever there’s a service interruption, or a potential service interruption, log an incident,” said David Stucky, a partner at service management firm Taruu and one of the first people to gain ITIL v3 Expert certification.
“Knowing when to log a problem ticket gives many IT shops more trouble; unnecessarily in my view. The bomb-proof test for knowing when to log a problem or not is also actually very simple: Are we already aware that the flaw exists in our environment or not?”
Stucky calls this the ‘KEDB match’ approach.
“Where a known error data base (KEDB) is being used to track problems and to provide the service desk with workarounds for incidents, this question translates to: Is there a match for the incident in our KEDB?” he explained. “If not, we log a problem ticket and start the wheels turning towards investigation and resolution of the underlying cause of the incident.”
This approach, he believes is far more robust than letting the incident manager decide on whether the issue should be escalated and raised as a problem.
“This weakness of this approach should be obvious,” Stucky said. “How does the incident manager make the call? Amazingly, I’ve seen this suggestion offered up by so-called ITSM ‘experts’ — apparently those fortunate enough to enjoy the services of incident managers possessed of better critical thinking skills than the experts themselves.”
Whether the tickets are raised by an excellent incident manager or through cross-checking the KEDB, it’s common to have incident and problem records open at the same time, as they have more similarities than differences.
“By definition, at the root of every incident there is a problem,” said Almond, “so much of the information about the incident is likely to help in the diagnosis of the root cause. When it comes to the detail of the records themselves it is very likely that the category and possibly even the priority will be the same.”
That doesn’t automatically mean that once you’ve closed the problem and fixed the root cause, all the associated incidents are closed too.
“In many cases, while we may have developed a permanent solution to an underlying flaw which causes incidents, thereby allowing us to close the problem ticket, it takes us longer to implement the solution throughout our environment,” said Stucky. “So incident tickets remain open. In short, there’s absolutely no necessary relationship between the lifespan of an incident ticket and a problem ticket.”
“In real life you have to manage in the gap between problems and incidents, as it’s rare that you can completely ensure that something will never happen,” added Peplow. “The way you manage it is simple, it’s proactivity. Anticipate the problem before it happens. I asked a colleague once which washing machine he’d buy, and he said the one with the best service contract. I told him that was wrong — he should get the one that doesn’t break.”
Elizabeth Harrin is Computer Weekly’s IT Blogger of the Year 2010. She is also director of The Otobus Group The Otobos Group, a business writing consultancy specializing in IT and project management. She’s the author of “Social Media for Project Managers ” and “Project Management in the Real World”. She has a decade of experience in IT and business change functions in healthcare and financial services, and is ITIL v3 Foundation certified.