Disaster Recovery Made Easy (Well, Sort of)

In a recent survey conducted by the Storage Networking Industry Association, 85 percent of participants reported that recovery or business continuity was the most important issue for them.

On Aug. 14, 2003, the largest major blackout in American history affected the northeast region of the U.S. and eastern Canada as a result of a generator failure at FirstEnergy Corp. in Akron, Ohio.

About 10 million people in Ontario, Canada were affected as were about 40 million in the U.S. Experts estimate that outage-related losses were between $4 billion and $10 billion. Experts also said that several factors contributed to the disaster, including inadequate disaster preparedness and software deficiencies.

Imagine how much money would have been saved if the systems at FirstEnergy were all in place, and hardware, applications and the network infrastructure were all aligned.

Anatomy of a Disaster

Specifically, a disaster is any unplanned event that disrupts “business as usual.” This makes a disaster really a business issue rather than a technology issue. So, in order to successfully manage a disaster, a preparedness plan should be in place.

According to a survey conducted by Gartner, two out of five companies that experience a catastrophic event or prolonged outage never resume operations.

Of those that do, one of three goes out of business within two years as a direct result of that outage or event. The conclusion: 60 percent of businesses affected by major disasters are out of business within two years.

Rather than hypothesize about potential disasters individually, an overall plan should address two major factors.

The first is the recovery time objective (RTO): How quickly must lost data be recovered after a disaster? Some systems might not need to be recovered immediately, while others must be brought back online as soon as possible.

The second point is the recovery point objective (RPO): To what point do the systems and information need to be recovered? Can a loss of time or a loss of data be incurred? If the last transaction was lost, could it be recovered it in another way?

For each aspect of a business, the RTO and the RPO must be identified. Once these two factors are determined, planning will fall easily into place.

The Plan

The first element that must be identified is what needs to be recovered in the event of a disaster.

Accounting applications, customer relationship management, financial systems and production management systems must all be brought back online in the event of a disaster.

Although running and restoring key operations is imperative, other services and data, including email, voice mail, access to the intranet or to the internet and forms, licenses, and other business information must also be accessible.

The next question is when lost data needs to be recovered.

The answer will determine recovery priorities. In most businesses, customer-facing functions and communications are imperative. In the event of a disaster, businesses must be able to communicate with their customers and employees.

Without this capability, it becomes exponentially more difficult to recover from the disaster or crisis. Personnel must know who to turn to in the event of a crisis, and they also must know what is expected of them.

The next question on the list is who will conduct the recovery.

Every individual in the company must be aware of his or her responsibilities for disaster recovery as well as what is expected of them. They must also be aware of the time in which their tasks should be completed.

The last element is how recovery will take place.

Unfortunately, most businesses cannot justify costs for a fail-over hot site, in which data instantaneously flips over and becomes available at an alternate location. Therefore, they must identify what kind of solution their budget will allow, as well as the RTO and RPO.

Each level will reflect the time needed to recover certain data and how the data will be stored. The cost will be higher and the technology more advanced for data that must be recovered in a short period of time. Consequently, it will cost less to recover data over a longer period of time.

To successfully implement a solution, a plan must be drafted. The RTO and RPO of each business application must then be identified. The next step is the installation of the technology, procedures and documentation of the plan.

Following that, a test run is required for each application and business function, during which time staff is cross-trained in recovery operations performance.

Finally, tests must be performed at least once per year to ensure that the backup procedures are functional, that all the technologies are still compatible and that the employees are still familiar with the proper procedures.

Lessons Learned

Data recovery processes are imperative to have in place in the case a disaster strikes, but testing the processes is also important.

Just as the military practices its drills to maximize effectiveness, it is crucial that a data recovery effort run smoothly. Therefore, the procedures should be tested at least once per year to ensure that everyone knows his or her responsibilities and who to communicate with.

Drills are also necessary to ensure that if any systems or procedures have changed, all components of a recovery effort will still be successful and error-free.

Although risk management can certainly be a costly venture, more and more companies are taking the proper precautions to ensure that, in the event of a disaster, they are prepared for seamless data recovery.

If recent headlines have taught us anything, they have taught us that companies can never be too safe when it comes to protecting their data. Lastly, ensuring that communication, training and procedures are in place will determine whether a company fails or succeeds at disaster recovery.

Bill Abram is president and founder of Pragmatix, a diversified IT company that builds custom database and web-enabled applications. He can be reached at 914-345-9444 or via e-mail at [email protected].