Register | Log in

Subscribe Now>>
Home News Tech2Tech Features Viewpoints Facts & Fun
Download PDF|Send to Colleague

Minimize risk, maximize availability

A practical approach to availability management.

by Mary Pat Simmons and Mark Hancock

Downtime is a fundamental metric for measuring productivity in a data warehouse, but this number does little to help you understand the basis of a system's availability. Focusing too much on the end-of-month number can perpetuate a bias toward a reactive view of availability. Root-cause analysis is important for preventing specific past problems from recurring, but it doesn't prevent new issues from causing future downtime.

Minimize risk, maximize availability

Potentially more dangerous is the false sense of security encouraged by historically high availability. Even perfect availability in the past provides no assurance that you are prepared to handle the risks that may lie just ahead or to keep pace with the changing needs of your system and users.

So how can you shift your perspective to a progressive view of providing for availability needs on a continual basis? The answer is availability management—a proactive approach to availability that applies risk management concepts to minimize the chance of downtime and prolonged outages. Teradata recommends four steps for successful availability management.

#1: Understand the risks
Effective availability management begins with understanding the nature of risk. "There are a variety of occurrences that negatively impact the site, system or data, which can reduce the availability experienced by end users. We refer to these as risk events," explains Kevin Lewis, director of Teradata Customer Services Offer Management.

The features of effective availability management:

> Improves system productivity and quality of support
> Encourages partnering to meet strategic and tactical availability needs
> Recognizes all sources and impacts of availability risk
> Applies a simple, holistic approach to risk mitigation
> Facilitates communication between operations and management
> Includes benchmarking using an objective, best-practice assessment
> Establishes a clear improvement roadmap to meet evolving needs

The more vulnerable a system is to risk events, the greater the potential for extended outages or reduced availability and, consequently, lost business productivity.

Data warehousing risk events can range from the barely detectable to the inconvenient to the catastrophic. Risk events can be sorted into three familiar categories of downtime based on their type of impact:
Planned downtime is a scheduled system outage, usually during low-usage or non-critical periods (e.g., upgrades/updates, planned maintenance, testing).
Unplanned downtime is an unanticipated loss of system, data or application access (e.g., utility outages, human error, planned downtime overruns).
Degraded downtime is "low quality" availability in which the system is available, but performance is slow and inefficient (e.g., poor workload management, capacity exhaustion).

Although unplanned downtime is usually the most painful, companies have a growing need to reduce degraded and planned downtime as well. Given the variety of risk causes and impacts, follow the next step to reduce your system's vulnerability to risk events.

#2: Assess and strategize
Although the occurrences of risk events to the Teradata system are often uncontrollable, applying a good availability management framework mitigates their impact. To meet strategic and tactical availability objectives, Teradata advocates a holistic system of seven attributes to address all areas that affect system availability. These availability management attributes are the tangible real-world IT assets, tools, people and processes that can be budgeted, assigned, administered and supervised to support system availability. They are:
Environment. The equipment layout and physical conditions within the data center that houses the infrastructure, including temperature, airflow, power quality and data center cleanliness
Infrastructure. The IT assets, the network architecture and configuration connecting them, and their compatibility with one another. These assets include the production system; dual systems; backup, archive and restore (BAR) hardware and software; test and development systems; and disaster recovery systems
Technology. The design of each system, including hardware and software versions, enabled utilities and tools, and remote connectivity
Support level. Maintenance coverage hours, response times, proactive processes, support tools employed and the accompanying availability reports
Operations. Operational procedures and support personnel used in the daily administration of the system and database
Data protection. Processes and product features that minimize or eliminate data loss, corruption and theft; this includes system security, fallback, hot standby nodes, hot standby disks and large cliques
Recoverability. Strategies and processes to regularly back up and archive data and to restore data and functionality in case of data loss or disaster

As evident in this list of attributes, supporting availability goes beyond maintenance service level agreements and downtime reporting. These attributes incorporate multiple technologies, service providers, support functions and management areas. This span necessitates an active partnership between Teradata and the customer to ensure all areas are adequately addressed. In addition to being comprehensive, these attributes provide the benefit of a common language for communicating, identifying and addressing availability management needs.

Figure 1: How reliable is your availability management
Answer the sample best-practice questions for each attribute. A "no" response to any yes/no question represents an availability management gap. Other questions will help you assess your system's overall availability management.

Dan Odette, Teradata Availability Center of Expertise leader, explains: "Discussing these attributes with customers makes it easier for them to understand their system availability gaps and plan an improvement roadmap. This approach helps customers who are unfamiliar with the technical details of the Teradata system or IT support best practices such as the Information Technology Infrastructure Library [ITIL]."

#3: Weigh the odds
To reduce the risk of downtime and/or prolonged outages, your availability management capabilities must be sufficient to meet your usage needs. (See figure 1, left.)

According to Chris Bowman, Teradata Technical Solutions architect, "Teradata encourages customers to obtain a more holistic view of their system availability and take appropriate action based on benchmarking across all of the attributes." In order to help customers accomplish this, Teradata offers an Availability Assessment service. "We apply Teradata technological and ITIL service management best practices to examine the people, processes, tools and architectural solutions across the seven attributes to identify system availability risks," Bowman says.

A comprehensive availability management assessment should consist of three phases: (See figure 2, below.)
Collect. Data is collected across all attributes, including environmental measurements, current hardware/software configurations, historic incident data and best-practice conformity by all personnel that support and administer the Teradata system. This includes customer management and staff, Teradata support services, and possibly other external service providers. Much of this data can be collected remotely by Teradata, though an assigned liaison within the customer organization is requested to facilitate access to the system and coordinate any personnel interviews.
Analyze. Data is consolidated and analyzed by an availability management expert who has a strong understanding of the technical details within each attribute and their collective impact on availability. During this stage, the goal is to uncover gaps that may not be apparent because of a lack of best-practice knowledge or organizational "silos." Silos are characterized by a lack of cross-functional coordination due to separate decision-making hierarchies or competing organizational objectives.
Recommend. The key deliverable of an assessment is a clear list of practical recommendations for availability management improvements. To have the maximum positive impact, recommendations must include:
  • An unbiased, expert perspective of the customer's specific availability management situation
  • Mitigation suggestions to prevent the recurrence of historical outages
  • Quantified benchmarking across all attributes to pinpoint the areas of greatest vulnerability to risk events
  • Corrective actions provided for every best-practice shortfall
  • Operations-level improvement actions with technical details to facilitate tactical implementation
  • Management-level guidance in the form of a less technical, executive scorecard to facilitate decision making and budget prioritization

Figure 2: Availability assessment process
Teradata collects data across all attributes and analyzes the current effectiveness of your availability management. The result is quantified benchmarking and actionable recommendations.

#4: Plan the next move
The recommendations from the assessment provide the basis for an availability management improvement roadmap.

"Cross-functional participation by both operations and management levels is crucial for maximizing the knowledge transfer of the assessment findings and ensuring follow-through," Odette says.

Typically, not all of the recommendations can be implemented at once because of resource and budget constraints, so it's common to take a phased approach. Priorities are based on the assessment benchmarks, the customer's business objectives, the planned evolution for use of the Teradata system and cost-to-benefit considerations.

Many improvements can be effectively cost-free to implement but still have a big impact. For example, adjusting equipment layout can improve airflow, which in turn can reduce heat-related equipment failures. Or, having the system/database administrators leverage the full capabilities of tools already built into Teradata can prevent or reduce outages. Lewis adds, "More significant improvements such as a disaster recovery capability or dual active systems may require greater investment and effort, but incremental steps can be planned and enacted over time to ensure availability management keeps pace with the customer's evolving needs."

An effective availability management strategy requires a partnership between you, as the customer, and Teradata. Together, we can apply a comprehensive framework of best practices to proactively manage risk and provide for your ongoing availability needs. T

Mary Pat Simmons has worked for NCR and Teradata for the last 20 years in a variety of sales and service marketing positions.

Mark Hancock is the offer manager for Teradata's Availability Management Services.

Teradata Magazine-June 2007

Related Links

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:

Protegrity | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.