Over the course of the past decade, organizations have begun to rely significantly more on information technology (IT) systems to support business-critical applications. Organizations such as banks, telecommunications companies, internet service providers, and cloud/co-location facilities rely heavily on the availability of their data centers as many of their customers are paying a premium for access to a variety of IT applications.

Because of this reliance, the cost of downtime can be detrimental, and costly, to an organization. In a recent white paper titled,  “Understanding the Cost of Data Center Downtime,” Emerson Network Power analyzed the financial impact on infrastructure vulnerability and found that the average cost of data center downtime was approximately $5,600 per minute. Further, based on an average reported incident length of 90 minutes, the average cost of a single downtime event was approximately $505,500. These costs are based on several factors including data loss or corruption, productivity losses, equipment damage, root-cause detection and recovery actions, legal and regulatory repercussions, revenue loss, and long-term repercussions on reputation and trust among key stakeholders. While the white paper cited business disruption and lost revenue as the most significant cost consequences of downtime, other less obvious costs such as losses in end-user and IT productivity also had a significant impact on the cost of an average downtime event (see figure 1).

feature

Figure 1. Cost of downtime attributed to its causes.

When it comes to avoiding data center downtime, data center managers have a lot of bullets to dodge—equipment or system failure, a natural catastrophe, and the excessive heat experienced by large parts of the country in recent months, to name a few.

Surprising as it may seem though, human error is more often the cause for data center downtime than a technological glitch or natural disaster. According to the Emerson Network Power white paper, “Addressing the Leading Root Causes of Downtime,” human error was the second root cause of reported unplanned outages (see figure 2). A study by The Uptime Institute also found that 70 percent of data center downtime is caused by human error. Today, it’s not just the systems that are dangerous, but also those responsible for maintaining and operating them.

According to Emerson Network Power’s white paper, “Understanding the Cost of Data Center Downtime,” downtime caused by human error accounts for nearly $300,000 in downtime costs per incident. Further, over a period of ten years, downtime events related to human errors can easily cost an organization in excess of $600,000. There is no doubt that human errors in the data center can cause a great deal of downtime and money loss, but how can data center managers ensure these errors don’t happen? Below are ten simple steps that can help:

• Communicate often. Interdepartmental communication between groups is a great starting point. There are many types of professionals linked to the data center, all with varying technical backgrounds, from IT folks who handle email and exchange services, to those managing networking/switches, to server administrators. Sometimes, even these groups are broken down into smaller factions, making communication even more difficult. Because of these separate functions, these groups often do not cross-pollinate or communicate, making human error more likely.

• Implement training. Ongoing training and precautionary policy development are also essential to preventing human error. Since data centers themselves are highly complex interconnected systems, training programs, and exercises among the different IT groups that emphasize a holistic approach to data center management could help address the problem. Data center managers should ensure that all individuals with access to a data center have basic knowledge of equipment so that it is not shut down by mistake.

• Develop secure access policies. Implementing guiding principles and adhering to secure access policies is critical. Organizations without data center sign-in policies run the risk of security breaches. Having a sign-in policy that requires an escort for visitors, such as vendors, will enable data center managers to know who is entering and exiting the facility at all times.
     Ensuring all individuals with access to the data center, including IT, emergency, security, and facility personnel, have basic knowledge of equipment so that it’s not shut down by mistake is crucial. Ongoing updates can keep this information top of mind with IT personnel.

• Dust off procedures and adhere to them. It might be a no-brainer, but ensuring consistent system operation is not always a given. Sometimes data center managers get too comfortable with operating the systems, do not follow procedures, forget or skip steps, or perform the procedure from memory and inadvertently shut down the wrong equipment. It is critical to keep all operational procedures up-to-date and follow the instructions to operate the system.
    Pace University in New York found it needed emergency service when the university experienced a complete load loss and its data center shut down at 4 a.m. on a Saturday morning. The cause was an old capacitor on a UPS that had not been replaced in a timely manner. Because every university department was fully automated, it was important that the data center be back online by Monday morning when classes began. The basic maintenance on the unit was up-to-date. It appeared that the issue was a result of age. The capacitor was near the end of its life expectancy. This sort of thing does not happen often, but when the capacitors do fail, they fail forcefully.

    This is why a documented method of procedure (MOP) is crucial. A standard MOP can be the answer to many unforeseen human errors. This step-by-step, task-oriented procedure mitigates or eliminates the risk associated with performing maintenance. Ensure back-up plans are included in case of unanticipated events. Not only is a documented method crucial but having it audited regularly for accuracy is important.

• Update one-line diagrams. Updated one-line diagrams should be standard in every data center. More times than not, even large Fortune 500 data centers do not have an accurate one-line diagram available. With design upgrades in the data center taking place all the time, it’s common to have multiple drawings but difficult to determine which is the most accurate. If the wrong diagram is used when making adjustments to the date center, human error is inevitable. One-line diagrams should be upgraded when the data center is upgraded because it is crucial in order to know a data center’s power and cooling capabilities. At the minimum, an annual review of one-line diagrams and procedures is recommended, especially prior to maintenance of equipment.

• Cover emergency off buttons. One of the most consistent human errors in the data center, more often than not, involves emergency off buttons or emergency power off (EPO) buttons. Emergency off buttons are generally located near doorways in the data center. Many times, these buttons are not covered or labeled and are mistakenly pushed which shuts down power to the entire data center. Unintentional shut downs can easily be avoided by labeling and covering emergency off buttons. The covering can be as simple as a small cage that is open to use but not as easy to inadvertently push.

About a month after opening a new facility, the director of data center services for a Michigan-based health-care system got a call. It was Easter morning, and a contractor had accidentally activated the EPO switch as he tried to replace a module connecting the button to the fire alarm system. According to the director, the fiasco "took the data center out." The health system is large with more than 45,000 employees, almost 400 outpatient clinics and facilities, and $6 billion in annual revenue. So when a data center goes down, it's a big deal.

Fortunately, the health-care center didn't have any major clinical systems online in the 12,000-square-foot data center because it was so new. Also, the organization had tried to discharge as many patients as possible so they could spend the holiday with family and friends rather than in the hospital.

"We went out at 8:30 that morning," the director said. "By 11:30 that night, we were probably 95 percent up and going, so we were pretty lucky.”

• Ensure correct labeling. Correct labeling is also a growing concern. If protection devices, such as circuit breakers, are not labeled correctly, it can have a direct adverse impact in keeping the data center load up. To correctly and safely operate a power system, all switching devices and the facility one-line diagram must be labeled correctly to ensure correct sequence of operation. Procedures should be in place to double check device labeling.

• Avoid food, drink, and other contaminants in the data center. Food and drinks should never be permitted in the data center. Liquids pose the greatest risk for shorting out critical computer components. Data center managers should post a food/drink policy outside the data center door, and strictly enforce the policy.

feature

Figure 2. Causes of downtime.

The same goes for contaminants. Not maintaining indoor air quality can cause unwanted dust particles and debris to enter servers and other IT infrastructure. This problem can be alleviated by having all personnel who access the data center wear antistatic booties or placing a mat outside the data center. Moving equipment inside the data center increases the chances that fibers from boxes and pallets will end up in server racks and other IT equipment. Therefore, equipment should always be packed and unpacked outside of the data center.

Another human-inflicted contaminant to be aware of can be introduced via new equipment such as air economizers that bring in outside air that can contain particles and other impure units. Many times the air brought in also doesn’t meet humidity requirements. Policies should be in place to ensure new equipment meets all standards and requirements for the data center.

• Secure cages and racks to protect equipment. Preventing unwanted contaminants/unauthorized persons from affecting the availability of the data center can be achieved by enclosing equipment in lockable, wire security cages and racks. These racks should be locked and key access controlled. Wire mesh partitions are perfect because they can fit nearly any configuration from a continuous long wall to doors between blade server racks to secure access. An advantage to using cages is they don't inhibit ventilation or heat discharge. By having a locked, secure cage, employees such as janitorial services, salespeople, or maintenance crews who need access to the room, can perform their duties without disrupting servers.
  A case in point, dust bunnies collect everywhere, even in server rooms with rubber floors, sleek black racks, and loads of fans. Why this happens has eluded even the finest scientific minds. What’s crazy is why offices routinely allow their cleaning people access to critical server rooms during unsupervised off-hours.
  Sometimes they know to leave the humming, blinking boxes alone. Other times, as was the case in one instance, they see dust bunnies not just around the server racks, but in them as well. Whereupon they open the racks and see servers with their cases partially open, and little dust bunnies bouncing around. So the cleaning people did what cleaning people do. They cleaned around the racks, inside the racks, and inside the servers ... with Windex. Needless to say the equipment had a malfunction and the cleaning crew caught the fire.

• Perform preventive maintenance (PM). Of course, human error can be reduced by having PM services implemented by using qualified, trained, third-party companies. While end-users can provide preventive support, such as replacing dirty air filters, ensuring environmental specifications are met, and maintaining and monitoring the UPS for alarms, they may not have the training for other highly skilled PM.

When choosing a service provider, seek out a group that offers a comprehensive portfolio of services. Service should be customized to satisfy data center requirements. In addition, a service provider should support the following to minimize your time to recovery should you experience a downtime event:

• 24x7 emergency services

• Guaranteed response times

• Local spare parts availability

• End-user training seminars detailing best practices and service tips

 

The service provider should also employ highly trained technicians that engage in ongoing industry training.

The frequency of PM visits depends on the type of UPS being utilized in the organization. Small UPS devices should be inspected annually to ensure alarms, filtering, and internal batteries are all operating within specifications.

For medium and large systems, which most likely include ancillary equipment, it is recommended that inspection and maintenance take place at least twice a year to ensure proper function and confirmation that the system is operating within the manufacturer's specifications. Emerson enlisted a Ph.D. level mathematician to help develop a mathematical model that takes the unit-related outages that occurred on these systems and accurately projects the impact of PM on UPS reliability. These calculations indicate that the UPS mean time between failures (MTBF) for units that received two PM service events a year is 23 times higher than a machine with no PM service events per year. At the expected levels of service error attributed to a Liebert trained and certified service engineer, UPS reliability continued to steadily increase up to 19 PM visits per year. The final conclusion of the real-world analysis and mathematical model reaffirmed the long-held industry belief that an increase in the number of PM visits substantially increases system reliability.

The data center is like a living entity, constantly growing and changing, with many moving parts. Given the risk that man-made mistakes pose to the data center, those who enter these facilities must adhere to rules and policies to prevent disasters.

Human error is a fact of life. We should anticipate it; try to understand it; and make sure we learn from past mistakes so that we can minimize its occurrence and impact on our critical systems.

By implementing these best practices, data center managers greatly decrease the chances of data center downtime through human error.

 

 

Reprints of this article are available by contactingJill DeVries at devriesj@bnpmedia.com or at 248-244-1726.