Register | Log in

Subscribe Now>>
Home News Tech2Tech Features Viewpoints Facts & Fun
Download PDF|Send to Colleague

Around the clock

Availability best practices enhance system uptime and minimize risks.

by Yuri Pinzon and Mary Pat Simmons

Heat in a data center rises to a critical temperature. The systems stop working, and important information is not delivered when and where it is needed.

A controller goes bad, and spare parts have to be ordered. While waiting for the parts, the system support team begins blindly trouble-shooting, possibly creating a new set of problems. The outage lasts five days, and the disruption escalates.


The data warehouse environment is not immune to unforeseen risks and catastrophe. Whether planned or unplanned, downtime can impede 24x7 accessibility to an organization's decision-support system and prevent the right information from getting to decision makers at the right time.

Rather than put the organization's decision-making ability at risk, availability best practices should be implemented. Starting with the development and use of robust software and proceeding with operator certification training and well-documented procedures, availability best practices cover the gamut of a data processing operation. Using proven techniques can mitigate risks from an outage by eliminating or reducing the length of interruption.

The key to establishing and maintaining a high-availability system is to examine an organization's infrastructure and effectively manage tangible attributes such as people, tools, processes and IT assets so they can be supervised, selected, administered and budgeted. Adopting availability strategies and implementing proper tools and features can enable organizations to minimize repair time.

Crisis support team
Each person who interacts with the system—operations, support staff, third-party vendors and users—has a role and task that work in conjunction to help ensure system availability.

First and foremost, a crisis coordinator should be appointed by management. As the one in charge during a crisis, the coordinator will orchestrate tasks to ensure each team member focuses as intently on data integrity and system availability while the system is down as they do when it is operational. In addition, the coordinator should be aware of any service level agreements (SLAs), including those documented for each person, role or task.

7 attributes of effective availability management

1. Environment. The physical conditions surrounding the IT assets
2. Infrastructure. Which IT assets are deployed and how they work together
3. Technology. The feature/functionality of each IT asset
4. Support level. Maintenance services to keep all IT assets running
5. Operations. Administrative services to manage daily operations
6. Data protection. Prevention of data loss, corruption and intrusion
7. Recoverability. Ability to recover data and user access after an outage

When an outage occurs, any delays in contacting decision makers can prolong the outage, so contact information for critical staff must be readily available to operations and the crisis coordinator. It's then the role of operations and application support team members to get the system operational again.

A variety of tools, such as Teradata Manager and Teradata Viewpoint, is available for system and applications monitoring that will help maintain system availability and ensure data accessibility. Teradata Manager serves as a comprehensive point of control for the Teradata Database and allows an administrator to easily identify performance inconsistencies that require attention. Teradata Viewpoint provides the administrator and business user simple portal-based access to status information on their servers and queries. In a typical, healthy environment, these tools work quietly and unobtrusively—the keys are the actions taken when an alert is escalated, an application misbehaves or a system resource runs low.

Operational staff should also have access to a searchable database that maintains historical records of prior failures with the resolution and, if available, the root cause. If the situation occurred previously in a test or development environment, the reason it occurred in a production environment should be investigated to lessen the chance that it will happen again.

Continuous availability processes
Certain technology should be used by the organization's management and IT staff, along with operational processes and strategies to ensure ongoing system operation and availability. Following are some recommendations on how to keep the system functioning at all times, even when it encounters unpredicted issues or when new system upgrades are introduced:
Manage change. Documenting, reviewing, testing and monitoring any adjustments to the environment that may affect the system users are some precautionary tasks that will save time and avoid aggravation. Before implementing any modifications to the system, the proposed change should be reviewed or tested to meet established success criteria. Any changes to the hardware or software, environment, network or facilities that do not meet the criteria should not move forward and should have a documented back-out plan.
Decrease software upgrades. Once a change is tested and approved, expect some associated downtime for its implementation. Tools and strategies offered in the Teradata environment will make upgrades run as quickly and smoothly as possible. Parallel Upgrade Tool spools necessary packages to the node before the actual change window.

For packages that require a reboot or a kernel rebuild, version migration and fallback enable upgrades to the alternate boot environment while the database is online. Once the maintenance window is entered, the change is a reboot away from the new environment. After the upgrade is completed, verification scripts are run to determine whether the change was successful. Unsuccessful changes can be resolved during the maintenance window or the system can be switched back to the original environment if fallback is enabled.

Multiple-system architectures can be used to altogether avoid planned outages during system upgrades. When multiple systems are synchronized, business-critical applications and users can be directed to an alternate system so the work is uninterrupted.

Minimize planned downtime. In a typical environment, multiple restarts may be required to change a memory module or adapter. Retaining a supply of stock parts on-site or having them readily available off-site is advisable so all work can be completed on a single trip. Node issues can be resolved with little impact on system performance or availability when hot standby nodes (HSNs) are used. Also, with remote virtual private network (VPN) connectivity in place, the Teradata Support Center can respond expeditiously.

High-availability technology
By far the most critical availability best-practices component is technology. Technology and system architecture are the most important aspects to having and maintaining a highly available system. The goal, of course, is to have a system that is available to users without interruption. Since downtime is unavoidable, the recovery focus is to isolate and then eliminate any single point of failure. For example, mirroring a disk can assist in resolving issues if that disk is problematic. Replicating an access module processor (AMP) via fallback or duplication on another system helps expedite problems encountered by a troublesome AMP. And using multiple power sources resolves possible issues with the power malfunctioning.

The following are some innovative tools and recommendations that will help minimize the effects of a system failure:
Fault-tolerant hardware. Teradata system architecture is built with high fault tolerance and availability standards. Mirrored internal disks, replicated AMPs, even replication on alternate systems are precautionary elements of the architecture. As mentioned earlier, HSNs are the most versatile hardware component available to minimize system downtime. Hot swappable disks, fans and controllers are also available. When hardware faults occur, it is critical that hot swappable items are identified so they can quickly be replaced. If there is any doubt, your Teradata Customer Services support person can consult with the Teradata Support Center to help quickly make this determination.
Environment. The system is set up to consistently monitor the temperature and humidity on all nodes. If a cabinet reaches hot status, the system is designed to shut down in an orderly fashion to protect the data. Redundant power and cooling sources should come from different conduits in case of an outage, and backup power in the form of an uninterruptible power supply should be available and up to specification at all times.
Disaster recovery site. Systems can become unavailable because of a hardware failure, flooding, fire, theft, an extended power outage or countless other unexpected events. When continuous availability is an SLA requirement, an off-site backup location, separate from the primary site, is necessary to ensure the organization's data is protected and available.

In addition, a disaster recovery plan must be in place, updated annually and tested for the system's recoverability. This is extremely critical and can be done either through resources within the organization or contracted out to disaster recovery experts.

Greater availability
Understanding and implementing availability best practices in the decision-support environment is crucial to mitigating risk. Establishing a benchmark and then targeting consistent, best-in-class processes, proper personnel alignment and innovative technology will contribute to greater availability and increased efficiency of the Teradata system. T

Methodology for mitigating availability risk

Teradata has developed a proven methodology for understanding and mitigating availability risk, based on the IT Infrastructure Library (ITIL) framework. The methodology includes tools for identifying specific availability management gaps and a portfolio of products and services to match availability needs. For example, Teradata's Parallel Upgrade Tool can spool packages so a software upgrade can be reduced to a five-minute restart. If a reboot is necessary, version migration and fallback can prepare the alternate boot environment.


Yuri Pinzon, a solutions architect, joined Teradata eight years ago and has been in the IT field for more than 15 years.

Mary Pat Simmons, a customer service marketing manager, has worked for Teradata over the last 20 years.

Photography by Getty Images

Teradata Magazine-December 2008

Related Links

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:

Protegrity | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.