Thursday, January 15, 2015

The "Problem" with Problem Management

Few IT organizations are really good at problem management; it is often only used for managing the aftermath of major incidents. I think that one of the reasons for this is confusion in the way we distinguish incident management and problem management. We could do a much better job if we changed how we think about these concepts.

I see two big issues with the way we currently define incident and problem management.

1. Failures that have not yet impacted service to users are not well handled by either incident management or problem management.
The ITIL definition of an incident says it is:
"An unplanned interruption to an IT service or a reduction in the quality of an IT service. failure of a configuration item that has not yet impacted service is also an incident. For example failure of one disk from a mirror set."
Even though I wrote this ITIL definition, I really don’t agree with the final sentence. If there has been a component failure that has no impact on any users then we don’t need to follow most steps of the incident management process, and we don’t want the outcome of incident management, which is to restore service to users as quickly as possible.
We can’t use problem management to manage these failures, because ITIL defines a problem as:
"A cause of one or more Incidents. The cause is not usually known at the time a Problem Record is created, and the Problem Management Process is responsible for further investigation."
Something that has not yet had an impact on any users is definitely not a problem, and it’s not helpful to call it an incident either. I think that we should separate these kind of issues, call them faults, and manage them with a "fault management" process. Fault management is well known in engineering as the process that detects, isolates and corrects malfunctions, which is exactly what is needed for this kind of thing.

2. Analysis of incident trends is part of a separate process (problem management), rather than integrated with the underlying process as it would be in every other service management process.
The work that we define as proactive problem management has nothing to do with managing problems. Analysing incident records to spot trends, and proposing changes to resolve the underlying causes of incidents is really doing continual improvement for incident management. Separating this activity from incident management, and from other continual improvement activity, can lead to the following consequences:
  • The register of improvement opportunities is split between problem management and continual improvement; this makes it harder to compare the costs and benefits of all the potential improvements that could be made to ensure the right ones are funded
  • We take away responsibility for incident management improvements from the owner of the incident management process; this makes accountability and governance very difficult
If we create a fault management process, as suggested above, and we then move responsibility for continual improvement of incident management to the owners of that process, then we end up with a much simpler approach.
Changes to incident management that we would need are:
  • Only dealing with service affecting incidents; failures that have not affected service are no longer considered to be incidents
  • Taking responsibility for measuring, monitoring and improving all aspects of the incident management process – including detecting trends in the incident management data
The new fault management process would include:
  • Detection, correction and resolution of all infrastructure and application malfunctions and errors
  • Ensuring that information about faults is made available to other service management processes (such as incident management)
This has a much simpler split of work than the incident / problem management split that we currently use.

No comments:

Post a Comment