Data Centre Reliability Checklist

Data Centre Reliability Checklist

Planning, creating, and building a data centre can be one of the most expensive tasks an IT director can face. In order to maximize cost effectiveness and achieve optimum performance, reliability is key.

Data centre size can range from one room in an office to an entire building, but there are some basic requirements which must be implemented to ensure system reliability. When designing a data centre, efficient planning is very important. A number of areas must be addressed to ensure a dependable and efficient system which is capable of continued operation.

Understand the potential causes of failure

There are a number of areas cited as the most common causes of data centre failure:

? Environmental problems
? Software failure – for example, memory leaks
? Hardware failure – such as storage or processing problems
? Operator or procedural error
? Poor network reliability
? Security breaches – for example hacker attack

Environmental considerations

When planning a data centre, there are a number of physical and architectural design features which must be implemented to ensure reliability:

? Adequate Air Supply: temperature must be maintained between 20 and 25 ºC and humidity between 40 and 60 %. Too much humidity can cause water to condense on internal components. However if the air is too dry, this can cause static electricity to discharge. Malfunction is likely if the above ranges are not maintained. This is one of the prime causes of data centre malfunction. Implementation of adequate air conditioning and correct architectural design to allow for air circulation between units is vital. Particular care needs to be taken to prevent “hotspots” from occurring.

? Safeguard against power loss: external environmental factors such as hurricane or snowstorm can cause power black outs. It is vital to have a generator to ensure continued function, as well as an uninterruptible power supply (UPS) for emergency power. These should be of sufficient size to power cooling systems.

? Fire protection systems: the simplest forms of fire protection are smoke detectors, for early detection of a fire. It is also vital to ensure fire containment to prevent the spread of a fire to the entire data centre. For example: Contained sprinkler systems or gaseous fire suppression.

Software, hardware or network failure

Tested and quality assured hardware and software from reputable brands can help increase reliability. Common malfunction in one component, such as an internal fan or storage disc, can quickly lead to failure in another. Ensuring network performance and reliability can also have a huge impact on the performance of the data system.

Operational procedures

It is impossible to completely rule out human error and operational issues. However, devising an operations procedure to not only maximize performance but also track reliability and malfunction is key. Conduct regular back-ups on each production server to ensure quick file repair in the event of damage. Provide adequate operator training to implement protocol and avoid the most basic of errors such as leaving discs in drives, which would prevent an auto-reboot in the event of system failure.

Data security

Particularly important in large data centres with sensitive information, is to ensure adequate physical security. Corporations may consider outsourcing their data centre to an off-site location with 24 hour security guards and video surveillance. System security also requires keeping up-to-date with the latest security and anti-virus software.

Avoid single point of failure

One final key consideration is to avoid having a single point of failure. Test the system before it goes operational and ensure that if one component fails there is sufficient backup to ensure the data centre can still function. Back-up will make sure that your important data is never lost.

Source by Amy Nutt