Optimizing the strength of Teradata and SAS enables better, faster analyses.
by Mike Rote
SAS is renowned for its statistical analysis capability and its business intelligence (BI) software. Teradata is known foremost as the
pre-eminent platform for decision-support-related applications enabled by the Teradata Database, a uniquely architected scalable and parallel
relational database management system (RDBMS). Pairing these software products—SAS and the Teradata Database—provides users extensive
capability for developing and deploying analytic applications utilizing a scalable data warehouse system.
Use SQL to process data sets, perform fundamental analyses, compile aggregations and do joins on the database moving
only the reduced data to servers and the desktop for exponential performance improvement.
In fact, major corporations worldwide recognize Teradata and SAS Institute as two of the best-known software companies. It is no coincidence,
therefore, that many businesses use both software products: the Teradata Database to enable the effective organization and analysis of large
amounts of data, and the SAS system to accomplish more and specific analyses. Many customers have expressed that the most desirable choice
would be to leverage SAS's application development capability directly on data housed in the Teradata Database.
A major decision in application design is whether the data is stored and processed in the data warehouse or extracted from the data warehouse,
placed on a server in the form of a SAS data set and then processed on that server. Often, some of both are desired. The implication is that
applications must be architected (and in some cases re-architected) so that the Teradata Database and the SAS system are utilized as efficiently
Optimize your environment
Bearing in mind the respective strengths of the SAS and Teradata technologies, the objective is to develop and deploy efficient applications
so the right processing is done on the right platform without unnecessary data movement. This typically means that the processing and
examination of large amounts of data should happen in the database. For smaller amounts of data where procedural language capability is
required, the specific analysis should take place utilizing SAS on the server.
Reducing data movement is essential for lowering costs and gaining efficiencies. Such reduction can be accomplished by either lessening the
amount of data being transferred over the network or decreasing the number of times transfers are performed. Aggregation, column reduction,
sampling and generating reports for BI or light analytic space should be done in the Teradata Database; that way, only the necessary data is
brought over to SAS. Analyzing formats and presenting results should be done via SAS.
Many users currently leverage these techniques; they push SQL to Teradata through SAS Proc SQL and then run other SAS code against the
resultant data sets. When the operations can be performed in SQL (especially where the data is large), pushing processing to the database
often yields dramatic performance improvements. This approach is pictorially represented in the bottom half of figure 1, above.
With Teradata's ability to consume predictive model markup language (PMML) generated by SAS Enterprise Miner and other
data mining tools, the scoring function can be accomplished in the database, regardless of where or how the model was
Eliminate traffic jams
Decreasing data traffic in the advanced analytic space can produce significant results in reducing process time. For instance, using a 500
million-row table of 100 numeric columns as input to a regression analysis would be a significant amount of data—even if the average column
size was 5 bytes. As an alternative to moving the data, a correlation matrix might be created in Teradata, reducing the data to a 100-by-100
matrix, and then that matrix could be provided to SAS for the regression analysis.
As another example, to obtain an analysis of the customers likely to respond to a new product offering, a response model could be developed in
the SAS environment. The platform choice for deployment has definite implications. By pushing the production model scoring process into
Teradata, it becomes unnecessary to move the data around the network each time this model is deployed to rescore the customers. This enables
the process to execute in a fraction of the time it would otherwise take.
One important consideration in this scenario is whether traditional SAS (BASE SAS and SAS/STAT) or SAS Enterprise Miner was used to develop
the model. In the traditional SAS environment, some manual effort is required to create the SQL scoring script from the model documentation.
SAS Enterprise Miner, however, can leverage the emerging predictive model markup language (PMML) standard so that it automatically generates
PMML. Teradata has the capability in its analytic data set generator product to read the PMML that SAS Enterprise Miner generates and
programmatically produce the scoring code in SQL and user-defined functions.
Consider the case of a wireless provider that has developed a model in SAS Enterprise Miner to identify customers unlikely to renew their
contracts. Each week, millions of customers need to be rescored. By generating the SQL scoring code from PMML, this process can run in the
database in mere minutes. (See figure 2, above.)
Some detailed considerations
Two primary options exist for getting data to SAS from Teradata: indirect and direct. With the indirect option, SAS reads a text file exported
from Teradata. This method can be cumbersome to manage and is not generally recommended. The preferred option is to use one of the direct
methods described below.
SAS Access to Teradata, the most efficient protocol for data movement, allows SQL to be passed from the SAS program to
Teradata utilizing the appropriate Teradata load utilities.
Within direct connectivity between the SAS system and the Teradata Database, there are two protocol options: SAS Access to ODBC and SAS Access
to Teradata. (For simplicity, these two options will be referred to as ODBC and SAS Access, respectively.) SAS Access is the recommended
choice in almost all cases, unless the data involved is very small. It is faster, more flexible and most fully leverages the strengths of
Teradata and SAS. (See figure 3, right.) SAS Access sits on the servers and workstations, which are clients to Teradata but can be
thought of as a pipe between the two technologies.
When extracting data from Teradata into SAS, SAS Access, by default, uses the Teradata call-level interface (CLI) since multi-session CLI is a
more efficient protocol than ODBC. Additionally, when a large amount of data needs to be moved, users can select options in SAS Access to
leverage the FastExport utility or, if moving data in the other direction, the FastLoad and MultiLoad utilities.
A SAS program passes SQL to Teradata via either implicit or explicit SQL pass-through. Implicit pass-through is performed when traditional SAS
code is translated to Teradata SQL without user intervention. When the translation happens as expected, this automatic method can be
advantageous. However, since user control is relinquished, it also runs the risk of being problematic. For example, if the requested SAS functions
cannot be translated into Teradata SQL, then SAS will extract the data, move it across the network and perform the requested commands on the
server. This could cause an unexpected data-flow bottleneck if the data in question is large, rendering this option of limited value.
Alternatively, explicit SQL pass-through allows the user to "pass" Teradata SQL from the SAS program to Teradata. This ability is key to the
efficiency of these pieces of software working together. With this full control (and responsibility) for the executed SQL, the user can obtain
the precise, required data from Teradata.
To move data from SAS into Teradata, selecting the proper option in SAS Access will generate the FastLoad or MultiLoad code and will load data
into Teradata with the chosen utility. If the amount of data is small, then the options do not need to be specified and row insert operations
will be executed. T
|Warner Home Video improves analytics
Optimizing analytics with Teradata has provided dramatic improvements to SAS analytics. Just ask Thomas Tileston,
vice president of global forecasting at Warner Home Video. As highlighted in the ITWorldCanada.com article
"Taking the sting out of forecasting," by Joaquim P. Menezes (Nov. 1, 2006), Tileston must accurately forecast
the demand for new movie titles so the right movies, in the right quantity, are at the right stores at the right
time. This is no easy task.
Tileston and his team combined the strengths of two key data mining and data warehousing technologies. After much
experimentation, they split the required tasks between SAS and Teradata. Previously, when done in a SAS-only
environment, new release forecasts for 300 titles took 36 hours. "By splitting tasks between Teradata (90 percent)
and SAS (10 percent), the same release forecast now takes an hour and 15 minutes. That's a huge win,"
Mike Rote is the director of Teradata's Advanced Analytics Center of Excellence (COE) for the Americas and is responsible for overall center
Teradata Magazine-June 2007