Operational Risk Management: Reaping the Benefits

Yet today’s savvy IT practitioners are showing increased interest in leveraging the best-practices frameworks provided by IT Infrastructure Library (ITIL), and IT Service Management (ITSM) elements in particular, which help IT organizations achieve excellence in Service Delivery and Service Support to ensure their business continuity and availability solution adequately supports the business’ needs, even as they change rapidly.

Four Steps to Better Risk Management

Step 3: Design and implement solutions. This is the step where the rubber the meets the road. The design of the solution should be guided by findings from the first two steps and encompass the entire IT environment, including storage, databases, applications, systems and networks.

Evaluate hardware, software and services needs in terms of established priorities to create an incremental implementation plan that addresses the most critical needs first. Include a continual service improvement plan.

Step 4: Monitor, manage and evolve. As the solution is implemented, ongoing monitoring and management is essential to maintaining a solution that is meeting the business’ needs.

Software management tools can aid in this process, but it is also critical to establish IT service management policies and training that aligns people and processes with best practices. This is where solid ITIL/ITSM practices can provide enormous value.

For example, a change management board can be used to trigger episodic or regular reviews and tests of continuity plans. Finally, as the business evolves, it’s important to reassess strategy, plans and solutions to ensure they are continuing to meet the business requirements and addressing new threats that may arise.

Business continuity and availability planning, data center and IT infrastructure operations and the implementation of IT-supported business processes are typically three distinct and disconnected processes. The challenge to integrate these processes also represents an enormous opportunity.

Business continuity and availability planning is typically focused on identifying and managing business risks, covering people, process and technology. This is the process that establishes recovery time objectives (RTO) and recovery point objectives (RPO), and tests and keeps plans up to date. It does not focus on IT (operational) risks.

When processes become integrated, business continuity and availability planning can play a valuable role in defining requirements for new processes, aligning service level agreements (SLAs) with RTO/RPO, updating plans for new business processes and ensuring compliance.

Similarly, with the traditional disconnected approach, the data center and IT infrastructure operations usually react to downtime and most IT managers complain they spend too much time in maintenance and too little time on improvements.

An integrated, proactive approach will see data center and IT infrastructure operations connecting RTO/RPO to SLAs and adopting best-practices to help keep business continuity and availability plans up to date and reduce operational costs.

The implementation of IT-supported business processes undergoes the same kind of transformation when looked at with an eye for creating a holistic solution. This is the process that typically defines SLA criteria for availability and performance.

Usually, it lacks the institutional connection to integrate with and update business continuity and availability plans, or the connection to RTO/RPO and does not focus on IT operational practices and their impact on these processes. However, it can instead play a critical role in delivering on SLA and RTO/RPO requirements while reducing operational costs to support new business processes.

The Resiliency Spectrum

The majority of companies do have some sort of business continuity/disaster recovery plan and solution in place, and with good reason. In the May 2006 survey, 90% reported they have a disaster recovery plan currently in place and 65% reported they had experienced outages of an hour or more, with the average being 10 hours. At an average cost of $90,000 per hour of downtime, this translates to a loss of nearly $1M in costs per outage.