Register | Log in

Subscribe Now>>
Home News Tech2Tech Features Viewpoints Facts & Fun
Applied Solutions
Download PDF|Send to Colleague

Combined forces

Optimizing the strength of Teradata and SAS enables better, faster analyses.

by Mike Rote

SAS is renowned for its statistical analysis capability and its business intelligence (BI) software. Teradata is known foremost as the pre-eminent platform for decision-support-related applications enabled by the Teradata Database, a uniquely architected scalable and parallel relational database management system (RDBMS). Pairing these software products—SAS and the Teradata Database—provides users extensive capability for developing and deploying analytic applications utilizing a scalable data warehouse system.

Figure 1: The two approaches to advanced analytics
Use SQL to process data sets, perform fundamental analyses, compile aggregations and do joins on the database moving only the reduced data to servers and the desktop for exponential performance improvement.

In fact, major corporations worldwide recognize Teradata and SAS Institute as two of the best-known software companies. It is no coincidence, therefore, that many businesses use both software products: the Teradata Database to enable the effective organization and analysis of large amounts of data, and the SAS system to accomplish more and specific analyses. Many customers have expressed that the most desirable choice would be to leverage SAS's application development capability directly on data housed in the Teradata Database.

Why integrate?
A major decision in application design is whether the data is stored and processed in the data warehouse or extracted from the data warehouse, placed on a server in the form of a SAS data set and then processed on that server. Often, some of both are desired. The implication is that applications must be architected (and in some cases re-architected) so that the Teradata Database and the SAS system are utilized as efficiently as possible.

Optimize your environment
Bearing in mind the respective strengths of the SAS and Teradata technologies, the objective is to develop and deploy efficient applications so the right processing is done on the right platform without unnecessary data movement. This typically means that the processing and examination of large amounts of data should happen in the database. For smaller amounts of data where procedural language capability is required, the specific analysis should take place utilizing SAS on the server.

Reducing data movement is essential for lowering costs and gaining efficiencies. Such reduction can be accomplished by either lessening the amount of data being transferred over the network or decreasing the number of times transfers are performed. Aggregation, column reduction, sampling and generating reports for BI or light analytic space should be done in the Teradata Database; that way, only the necessary data is brought over to SAS. Analyzing formats and presenting results should be done via SAS.

Many users currently leverage these techniques; they push SQL to Teradata through SAS Proc SQL and then run other SAS code against the resultant data sets. When the operations can be performed in SQL (especially where the data is large), pushing processing to the database often yields dramatic performance improvements. This approach is pictorially represented in the bottom half of figure 1, above.

Figure 2: Integrated data mining environment
With Teradata's ability to consume predictive model markup language (PMML) generated by SAS Enterprise Miner and other data mining tools, the scoring function can be accomplished in the database, regardless of where or how the model was developed.

Eliminate traffic jams
Decreasing data traffic in the advanced analytic space can produce significant results in reducing process time. For instance, using a 500 million-row table of 100 numeric columns as input to a regression analysis would be a significant amount of data—even if the average column size was 5 bytes. As an alternative to moving the data, a correlation matrix might be created in Teradata, reducing the data to a 100-by-100 matrix, and then that matrix could be provided to SAS for the regression analysis.

As another example, to obtain an analysis of the customers likely to respond to a new product offering, a response model could be developed in the SAS environment. The platform choice for deployment has definite implications. By pushing the production model scoring process into Teradata, it becomes unnecessary to move the data around the network each time this model is deployed to rescore the customers. This enables the process to execute in a fraction of the time it would otherwise take.

One important consideration in this scenario is whether traditional SAS (BASE SAS and SAS/STAT) or SAS Enterprise Miner was used to develop the model. In the traditional SAS environment, some manual effort is required to create the SQL scoring script from the model documentation. SAS Enterprise Miner, however, can leverage the emerging predictive model markup language (PMML) standard so that it automatically generates PMML. Teradata has the capability in its analytic data set generator product to read the PMML that SAS Enterprise Miner generates and programmatically produce the scoring code in SQL and user-defined functions.

Consider the case of a wireless provider that has developed a model in SAS Enterprise Miner to identify customers unlikely to renew their contracts. Each week, millions of customers need to be rescored. By generating the SQL scoring code from PMML, this process can run in the database in mere minutes. (See figure 2, above.)

Some detailed considerations
Two primary options exist for getting data to SAS from Teradata: indirect and direct. With the indirect option, SAS reads a text file exported from Teradata. This method can be cumbersome to manage and is not generally recommended. The preferred option is to use one of the direct methods described below.

Figure 3: Best of Both Worlds
SAS Access to Teradata, the most efficient protocol for data movement, allows SQL to be passed from the SAS program to Teradata utilizing the appropriate Teradata load utilities.

Within direct connectivity between the SAS system and the Teradata Database, there are two protocol options: SAS Access to ODBC and SAS Access to Teradata. (For simplicity, these two options will be referred to as ODBC and SAS Access, respectively.) SAS Access is the recommended choice in almost all cases, unless the data involved is very small. It is faster, more flexible and most fully leverages the strengths of Teradata and SAS. (See figure 3, right.) SAS Access sits on the servers and workstations, which are clients to Teradata but can be thought of as a pipe between the two technologies.

When extracting data from Teradata into SAS, SAS Access, by default, uses the Teradata call-level interface (CLI) since multi-session CLI is a more efficient protocol than ODBC. Additionally, when a large amount of data needs to be moved, users can select options in SAS Access to leverage the FastExport utility or, if moving data in the other direction, the FastLoad and MultiLoad utilities.

A SAS program passes SQL to Teradata via either implicit or explicit SQL pass-through. Implicit pass-through is performed when traditional SAS code is translated to Teradata SQL without user intervention. When the translation happens as expected, this automatic method can be advantageous. However, since user control is relinquished, it also runs the risk of being problematic. For example, if the requested SAS functions cannot be translated into Teradata SQL, then SAS will extract the data, move it across the network and perform the requested commands on the server. This could cause an unexpected data-flow bottleneck if the data in question is large, rendering this option of limited value.

Alternatively, explicit SQL pass-through allows the user to "pass" Teradata SQL from the SAS program to Teradata. This ability is key to the efficiency of these pieces of software working together. With this full control (and responsibility) for the executed SQL, the user can obtain the precise, required data from Teradata.

To move data from SAS into Teradata, selecting the proper option in SAS Access will generate the FastLoad or MultiLoad code and will load data into Teradata with the chosen utility. If the amount of data is small, then the options do not need to be specified and row insert operations will be executed. T

Warner Home Video improves analytics

Optimizing analytics with Teradata has provided dramatic improvements to SAS analytics. Just ask Thomas Tileston, vice president of global forecasting at Warner Home Video. As highlighted in the article "Taking the sting out of forecasting," by Joaquim P. Menezes (Nov. 1, 2006), Tileston must accurately forecast the demand for new movie titles so the right movies, in the right quantity, are at the right stores at the right time. This is no easy task.

Tileston and his team combined the strengths of two key data mining and data warehousing technologies. After much experimentation, they split the required tasks between SAS and Teradata. Previously, when done in a SAS-only environment, new release forecasts for 300 titles took 36 hours. "By splitting tasks between Teradata (90 percent) and SAS (10 percent), the same release forecast now takes an hour and 15 minutes. That's a huge win," says Tileston.

—Arlene Zaima

Mike Rote is the director of Teradata's Advanced Analytics Center of Excellence (COE) for the Americas and is responsible for overall center operations.

Teradata Magazine-June 2007

More Applied Solutions

Related Link

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:

Protegrity | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.