Prevent Disaster in the Data Center

There are many reasons to create segregated physical locations for servers and other critical infrastructure equipment.

First, access is controlled, thus limiting security threats. Second, the controlled access limits human error arising from accidents and “curiosity.” Third, the concentration allows for efficient oversight and administration. Fourth, and the focus of this article, the relative consolidation of assets enables a controlled environment to better manage the risks associated with air-conditioning, fire and flooding.

Air-Conditioning/System Cooling

Today’s IT systems generate a tremendous amount of heat and need dedicated air-conditioning systems to be properly cooled. Some systems are even implementing cooling systems that dissipate heat via mechanisms other than the traditional muffin fans and are reminiscent of the days of mainframes and water cooling (at least to those of us who have been around long enough to have seen the water pipes and raised floors).

Systems that are running hotter than recommended are more likely to have component failure than ones in a cooler setting. Some years ago, I was involved with a small server room that didn’t have a dedicated AC unit, but did have a dedicated duct. It worked great during the week when people were present to cause the AC unit to run because the thermostat wasn’t in the server room. On weekends, the office area would cool off quickly and shut down while the server room baked. We knew something odd was going on when RAID drives and other components started failing far too often.

The climax came when a Dell-hosted clustered SQL Server system announced at the console that it had reached a critical internal temperature and was shutting down immediately to protect itself. This made several production departments grind to a complete halt. The first step was to put in a temperature probe that had an IP address that could be SNMP-polled every few minutes. The data was logged, trended graphically and the resulting report to senior management with graphics resulted in a dedicated AC unit getting capital approval and installed in record time.

A second benefit of air-conditioning relates to filtered air. Manufacturing environments are often very dusty places. Systems with cooling fans that either draw or push air through a cabinet to cool actually wind up coating all components with dust over time in uncontrolled environments. Depending on the thickness and type of dust, overheating and/or short circuits can happen. Air conditioning feeds to data centers should have the dust removed and ensure that humidity is at proper levels.

When planning for cooling systems in a data center, take power failure into consideration. Frequently, groups plan to keep the equipment and lights on, but overlook cooling. In the event of power failure, air-conditioning (or whatever the cooling system is) may very well be needed to protect sensitive electronics.

Lastly, don’t guess on the requirements. Consultants and vendors have formulas to determine the size of cooling systems based on current needs as well as future growth.

Conditioned Power

IT systems need stable, reliable power. It is not cost-effective to buy dozens of good UPSes. All too often, IT buys dozens of cheap systems to protect distributed systems. It is more economical to buy several good systems that can protect dozens, if not hundreds, of devices than buying one-off power fixes.

First, lightning strikes need to be dealt with. Second, fluctuations in voltage, harmonics, EMI/RFI and other problems need to be removed. Third, in the event of an outage, there must be a solution that allows for the systems to stay on-line the necessary amount of time for a controlled shutdown and this may mean UPSes or a mixture of UPSes and generators. These types of solutions are very economical when applied to a large collection of systems, but less so when applied to fewer and fewer systems.

Moreover, all these systems need maintenance and the fewer the better. Monitoring and swapping batteries in a handful of enterprise UPSes is better than trying to keep track of dozens of small UPSes spread all over. In the end, business needs and associated risks must drive the solution and thus the investment. IT must architect with centralization and/or consolidation in mind.

Fire Management

The best way to deal with a fire in a data center is when it is just starting. There are fire detection systems that are so sensitive they can detect the increase in particulates and temperature as a group moves through a data center. These sensors go far beyond the traditional smoke detectors and can send alerts via the network as well as backup means. These systems can be deployed in a controlled environment such as a data center with much success. The whole idea is to detect a problem and react before the fire becomes significant and is manageable.

By layering early detection with a corrective control, namely suppression, the risks of damage from fire can be further mitigated. Take the time to investigate fire suppression technologies — including Inergen, which is a combination of gasses, and Sapphire, which is a very interesting liquid that changes to a gas with very little additional energy — that can put out fires without damaging electronics and leaving particulates. There are many options and the trick is to pick the one best suited to your needs and expert guidance should be sought.

Using the threat of fire as an example, always think about how to compensate in layers. How can the risk be prevented? How can it be detected early on when the impact is minimal? How can the problem be corrected? Most times, a layered approach is more effective and reliable than any single method.

Water

For some data centers, flooding is a very real concern. In dedicated data centers, it is possible to elevate equipment, re-route water pipes, disconnect water sprinklers and use alternative fire suppression systems, protect key wiring, install sump pumps, alarms and so on all aimed at reducing the risk of damage due to water in a particular location.

Summary

Environmental issues need to be addressed to ensure availability. The mixture of elements to consider depends on the data center, geographic location and so on. It is not the intent of this article to argue for total centralization, but rather pragmatic consolidation. Some systems must be located relatively near the user community and need to be protected regardless. In all cases, a balance must be struck between costs, risks and benefits.

In the end, its all about meeting the needs of the business. Today, when IT systems fail for whatever reason, it’s not just old-fashioned report printing that stops — it is the business that stops.

George Spafford is an IT consultant and a long-time IT professional. He focuses on compliance, management and process improvement. More information is available here.