Register | Log in

Subscribe Now>>
Home News Tech2Tech Features Viewpoints Facts & Fun
Applied Solutions
Download PDF|Send to Colleague

Reduce movement, gain speed

New SAS and Teradata integrated solution will move analytical data inside the data warehouse for faster performance.

by David Shamlin and Mike Rote

Businesses today want answers fast, with automated processes and complete accuracy. Organizations that use a data warehouse are getting these results, but they can soon be getting them faster and with even less effort.

In October 2007, SAS™ and Teradata announced a partnership designed to create breakthrough value by strengthening the integration between the SAS System and the Teradata enterprise data warehouse (EDW). This strengthened integration will move the analytic processing power of SAS closer to the data managed by Teradata. The end result will be an environment that looks and behaves like SAS from the analyst's perspective and like Teradata from the system administrator's perspective.

Figure 1 displays the current and future process architectures. The future environment is based on the SAS function under development. To perform an analysis today, large volumes of data are often transferred from the Teradata EDW to SAS. In the future environment, key SAS functions are packaged to run inside the Teradata EDW, significantly reducing data movement and duplication. This new method will minimize the effort and shorten the time it takes to implement mission-critical applications, will reduce the time needed to answer business questions, and will simplify overall system administration. The net effect will be organizations with more agile business processes and more resources freed to focus on tasks with the greatest business value.

Figure 1: SAS integration with Teradata
Currently, data analytics are done outside of the Teradata enterprise data warehouse (EDW), as shown on the left in the figure. With the new SAS and Teradata solution (right), SAS functions are contained within the Teradata EDW, reducing data movement and enabling faster analysis.

Proven results
The new method was tested to ensure the functionality of the SAS and Teradata solution. A series of proof-of-concept (POC) projects was executed to test a prototype integration of the two companies' software products. Four key SAS components were selected for the tests because they embody basic SAS constructs that recur in many SAS applications. The following are descriptions of how these SAS components worked within the Teradata system:
SAS formats. Enabled repackaging proprietary SAS software as user-defined functions (UDFs), which can be executed inside the Teradata system
Analytic model scoring algorithms. Deployed SAS-generated custom models as UDFs inside the Teradata database
SAS FREQ. Showed that SAS components can directly generate and "push down" SQL to Teradata, significantly reducing data movement between the two systems
SAS Stored Processes. Proved SAS programs can be accessed by Teradata as if they are Teradata stored procedures

The FREQ procedure
The SAS System comprises hundreds of modules of prepackaged functionality referred to as procedures. One common functionality is the FREQ procedure (aka PROC FREQ). The following is a simplified example showing how the frequencies and percentages of customer credit scores by state are calculated:

proc freq data=customer.credit_data;
     table state * credit_score;

The DATA= option tells the PROC FREQ to use the data found in customer.credit_data for its calculation. "Customer" refers to the location of the data (in this case, a Teradata Database) and "Credit_data" is the table in that database. The TABLE statement instructs PROC FREQ to gather statistics on the credit_score column and group the results by the unique values of the state column.

To produce the report shown in figure 2, the credit_score values must be counted by state. Traditionally, these counts are done by SAS as illustrated in figure 3 below: SAS executes the PROC step and connects to Teradata via a SAS engine, the PROC directs the engine to fetch the rows in the input table, and the engine turns this into a basic SQL select statement. This results in a transfer of all the rows in the Teradata credit_data table to SAS.

Counting column values by group is also a standard SQL operation. In this part of the POC, PROC FREQ was modified to pass a more substantial SQL query to the SAS engine that could be pushed to Teradata. This was accomplished by enabling PROC FREQ to dynamically generate a SAS SQL view. Below is the SQL query generated by the PROC FREQ example:

select count(*), state, min(state),
	credit_score, min(credit_score),
        group by state,credit_score

The view is given to the access engine, where it is re-textualized into Teradata SQL. Regardless of the number of existing rows in credit_data, 51 rows (one for each state plus Washington, D.C.) passed from Teradata to SAS.

Running SAS scoring code inside Teradata
Many businesses use analytic modeling to provide insight into questions such as: "Is a customer a good candidate for a marketing campaign?" and "Does a credit card transaction appear fraudulent?" However, as data volumes required to answer these questions continue to grow, in some cases doubling in size, companies are challenged to create and deploy scoring models in a timely manner. These businesses are turning to vendors to accelerate the analytic cycle to meet their growing data problem.

Figure 2: FREQ procedure output
This report is an example of what is produced when the frequency procedure, also known as PROC FREQ, computes basic descriptive statistics and produces n-way tabular reports.

SAS software has a rich set of data mining tools for developing models and scoring functions. The SAS DATA step, typically used to implement scoring functions, is inherently table-driven and lends itself neatly to many scoring operations. The input table contains customer transaction data to be scored; the output table stores the results.

Because the DATA step scoring functions are done in a row-independent manner, they can be executed in Teradata's shared-nothing environment. To test the performance of the system's scoring functions, the POC used:
An analytic model that scores a hypothetical bank customer's loan risk
A UDF approach to develop a prototype for deploying a scoring function created with SAS data mining tools
A six-node Teradata 5500 Server

Since the DATA step employed a "just in time" compiler with dependencies on other SAS subsystems, challenges arose. Therefore, in place of the DATA step, a new language processor called the Table Server Programming Language (TSPL) was implemented. The TSPL supports a syntax similar to that of the DATA step but is built on a more encapsulated framework.

Next, a tool was created to translate the subset of DATA step syntax generated by SAS data mining tools into TSPL statements. These statements were then deployed and executed as a Teradata UDF.

When the system was tested using DATA step (external to Teradata), about 16,400 rows per second were scored, and execution bottlenecks were observed on the I/O channel between SAS and Teradata. Also, processor activity on the Teradata nodes was intermittent and appeared randomly distributed. By comparison, executing TSPL as a UDF inside Teradata scored up to 750,000 rows per second when all six nodes were utilized (125,000 per node/second). Furthermore, the processors appeared more uniformly active as work was distributed across the systems.

This prototype demonstrated that a complex SAS program can be executed inside Teradata, offering enhanced performance and additional business value; that is, by moving the scoring function to the data, any security "envelope" implemented using the Teradata system is honored. This keeps IT governance concerns focused on the Teradata EDW and reduces costs incurred whenever data from the EDW is replicated in analytic data marts implemented with SAS data sets.

SAS Stored Process integration
To further test the system, another POC project was executed: the binding of SAS Stored Processes to Teradata stored procedures. A SAS Stored Process is any SAS program packaged for execution with the SAS Stored Process Server, one of the SAS9 platform cornerstones on which SAS solutions are built. Input and output parameters can be defined on a stored process to allow the end user to control the behavior of an individual invocation. In short, the SAS Stored Process Server allows a programmer to publish SAS programs that can subsequently be run on demand.

Figure 3: Traditional FREQ procedure data access pattern
In a much more efficient and faster method than the traditional technique, SQL pushdown processes the data directly in the Teradata enterprise data warehouse (EDW).

SAS solutions use stored processes to implement parts of their functionality. The SAS Stored Process Server can be deployed inside Teradata, giving organizations the option to implement their Teradata EDW and SAS solutions in a single hardware environment, thereby lowering their IT infrastructure complexity and cost.

To test this option, a new set of SAS Stored Process Server interfaces was created inside Teradata to allow SAS Stored Processes to be invoked as Teradata stored procedures. A SAS Stored Process Server was also deployed on a node without database software. This procedure enabled a Teradata SQL statement to reference an arbitrary SAS program; the full breadth of SAS functionality became available for embedding in a business process driven by a Teradata query. It also allowed data movement to stay on the Teradata environment.

Integrated technology
The POC tests demonstrate the ability to deploy proprietary and user-defined SAS processes inside a Teradata EDW thus serving as a basis for a tightly integrated Teradata/SAS architecture. The potential performance gains of such an approach are significant.

At the time of this printing, the POC code is being reviewed and plans are being put in place to deliver this initial functionality in mid-2008, offering organizations immediate benefits to the SAS In-Database implementations. Also being investigated are ways to leverage the strategy of dynamical SQL generation with SAS statistical algorithms repackaged as Teradata UDFs in data mining procedures.

The partnership between SAS and Teradata is positioned to deliver truly scalable analytic solutions, improve time-to-value and reduce technology infrastructure costs. Businesses will benefit from the integrated technology platform with improved performance, enhanced productivity and a lower total cost of ownership. T

David Shamlin, R&D storage director at SAS, has been with the company for 20 years and, as a member of the joint engineering team, helped develop the technology integration POC.

Mike Rote, director of the Teradata/SAS Center of Excellence, has been with Teradata for more than 15 years, most recently as director of the advance analytics team.

Teradata Magazine-March 2008

More Applied Solutions

Related Link

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:

Protegrity | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.