Administrators constantly assess data center and network capacity needs in preparation for a potential disaster but are often at a loss to respond in an optimal way to balance capacity and performance during an actual disaster. Disasters may be natural, or caused by a failure of power, cooling, or IT equipment. Mitigation plans must integrate data and application priorities with the available backup resources.

LESS DEPENDABLE ELECTRICITY

The average data center in the U.S. experiences about five power outages over a two-year period, with each lasting an average of 106 min, according to a detailed study by the Ponemon Institute. A full 88 percent of respondents experienced at least one power outage in the last two years. Although 69 percent of respondents experienced between one and five outages over the past two years, a majority continue to have confidence in the local electric utility. Recognizing the inevitability of the occasional power outage, most data centers (100 percent of those ranked best-in-class according to Aberdeen Group) have a UPS.

Data center operators should expect to experience more power problems in the future for a new reason: the electrical grid is rapidly nearing a crisis. According to the U.S. Energy Information Administration (EIA), the world’s power plants were capable of generating around 4,500 gigawatts (GW) of electricity in 2007, and with demand increasing by 1.7 percent annually, total demand is expected to exceed 7,000 GW by 2035. Safety concerns over nuclear facilities and CO2 emissions from coal- and gas-fired power plants will prevent sufficient additional capacity from being added to meet the growing demand. And renewables, such as wind and solar, will be unable to make up the difference.

As the crisis worsens, businesses will be asked to cut back on their power consumption during peak periods, and data centers (as major users) will likely be asked to participate to some extent. Utilities call these programs demand response (DR) or demand-side management, and they can be expected to become both more frequent and more severe. Should these programs fall short of balancing supply with demand, utilities would be forced to implement rolling brownouts and blackouts.

The impending crisis is already forcing utilities and their regulators to take action. The Electric Reliability Council of Texas (ERCOT), for example, institutes reductions from 3 to 7 p.m. during periods of extreme heat when air conditioners increase the total load on the grid, and if demand is not reduced sufficiently, local utilities will be permitted to implement rolling blackouts lasting from 15 to 45 min.

PREPARE FOR POWER PROBLEMS

The best disaster recovery and business continuity plans address multiple scenarios, ranging from the brief brownout to the extended blackout, including weeklong outages when fuel supply services for the UPS generator(s) may no longer be available. Ideally, these plans are created by a cross-functional team that makes continuous improvements based on regularly scheduled tests. Such best practices are difficult to achieve, however, owing in large part to the difficulty of testing the plan to ensure its effectiveness. According to the AT&T 2010 Business Continuity Study, fewer than half of U.S. companies have tested their plans in the past year, and more than a quarter have never tested it. The Aberdeen Group found that even among best-in-class organizations, only 71 percent test disaster recovery plans regularly.

Part of the problem is a lack of tools needed to formulate, test, and implement effective response plans for various power and/or cooling failure scenarios. Gartner has begun to track two such new tools. One is the data center energy management (DCEM) system that monitors, measures, manages, and/or controls energy and the data center environment. The other is the data center infrastructure management (DCIM) system that monitors, measures, manages, and/or controls data center performance, utilization, and energy consumption of IT related equipment in single or multiple data centers. Given the increasing importance of power management in data centers, Gartner expects some 60 percent of businesses will be using a DCEM and/or DCIM system by 2014 (see the sidebar om page 63).

The ability of the DCIM system to perform real-time monitoring, trend analysis, and forecasting can help data center operators formulate more realistic responses to different outage events. An in-depth understanding of power utilization and application tiers, patterns, and behaviors is needed to respond optimally to different failure scenarios and to better satisfy critical needs during any disaster. Figure 1 shows a sample of the information available.

Some DCIM tools also support runbook automation. Runbooks automate the many steps involved in responding to outages and demand response events, thereby making it far easier to test and tune the plans, and minimizing human error during an actual event. Special runbooks can even be created to periodically test redundancy provisions while the systems are under different loads in preparation for a potential failure.

Consider, for example, the failure of the computer room air conditioner. Without any response, as shown in figure 2, the temperature would begin to rise in the data center. At some point, operators would need to begin capping power or even shutting down some servers and/or storage systems to prevent hot spots from forming. Without automation, errors in selecting the right systems and taking the right steps at the right time and in the right order are inevitable. By contrast, runbook automation enables an immediate response by power-capping nonessential servers and storage tiers, thereby prolonging the operation of mission-critical applications.

APPLICATION SERVICE LEVEL TRIAGE

In the cooling failure example, a prolonged outage escalates the situation, making it necessary to shed or shift additional loads, potentially reducing service levels even further for nonessential applications, and ultimately, affecting some fairly critical ones. Determining the optimal order to escalate the response requires a form of triage.

Service levels may apply to the entire data center, as is often the case for managed services. In other situations, each application must be assigned to an appropriate tier (IV for mission-critical, III for highly desirable applications, and II for nonessential). Tiers enable resource allocations to be made in a way that satisfies needs in priority order, regardless of the extent or duration of the outage or disaster.

The failure or limitations of common data center infrastructure components during any disaster also requires a form of triage to ensure that available resources are being used to satisfy the most critical needs. For outages in the cooling system, it is also important to consider the cold-aisle temperature. A facility operating near the American Society of Heating, Refrigerating and Air-Conditioning Engineers’ (ASHRAE) recommendation of 80°F (27°C) might need to be shed load fairly quickly. For server capacity triage, the transaction performance of available systems at different power capping settings will need to be factored into load shedding and shifting decisions to ensure the effectiveness and proper implementation of the disaster recovery plan, as shown in figure 3.

What is often missing from triage assessment is an accurate understanding of the actual power used by specific applications and servers. A DCIM or DCEM system that monitors power to the individual server level helps, but even greater granularity may be needed, especially in virtualized environments, to understand power consumption at the application level under various loads.

A new standard, UL2640, from Underwriters Laboratories (UL) gives IT managers just such an additional level of granularity. Intended to help organizations choose more energy-efficient systems, UL2640 employs the PAR4 Efficiency Rating system for determining both absolute and normalized (over time) energy efficiency of both new and existing equipment on a transactions per second per watt basis. To calculate server performance using the UL2640 standard, a series of standardized tests is performed, including a power-on spike test, a boot-cycle test, and a benchmark test. The test results can then be used to determine idle and peak power consumption, along with transactions/second/watt and other useful efficiency ratings. These metrics can then be used to determine how best to shed and shift loads, as well as the extent to which power should be capped for different service levels.

REDUNDANT DATA CENTERS

Many of these preparations can also prove valuable with the ultimate business continuity configuration: fully redundant data centers separated geographically. With such a configuration, it is possible for aggregate server capacity to be made sufficient to support all critical applications during a total failure at one data center. In these situations, the runbooks contain the detailed steps needed to shift the most critical applications (including the servers, storage, and networking) from/to available locations and resources with minimal service interruption, potentially including the powering down/up of servers as conditions change throughout the course of the disaster.

Multiple data centers can also be leveraged to form a sophisticated automated DR capability. Peak demand occurs at around the same local time of the day, but with data centers in different time zones, the local load at one can be shifted to the other.

In addition, organizations reduce costs even further using a “follow the moon” strategy with load being shifted to the data center where the rates for electricity are the lowest, which is invariably at night when ambient temperatures are also at their lowest, thus saving on cooling costs.

Power is precious and becoming more so. The steps organizations take to manage power utilization more effectively in the data center will help improve business continuity during disasters and, if done right, can help pay their way with lower operating costs under normal conditions. 

 

Reprints of this articleare available by contacting Jill DeVries at devriesj@bnpmedia.com or at 248-244-1726.

SIDEBAR

DCI/EM Tools

The basic DCI/EM systems help data centers improve overall power usage effectiveness (PUE), which is the ratio of total power consumed and the power used by the IT equipment. Improving PUE from 2.3 (fairly typical today) to 1.3 (a target established by the U.S. Environmental Protection Agency), for example, nearly doubles the power available for IT equipment.

The greater efficiency helps during a power and/or cooling outage by enabling the standby generator to support more mission-critical applications for a longer period of time. The best DCI/EM systems will help during DR events by automating the steps involved to shed non-critical loads, to shift critical loads to different virtualized servers or even different data centers, and/or to let the temperature rise while carefully monitoring ambient conditions to prevent hot spots from forming. The same automation capabilities also facilitate regular testing of disaster recovery plans.