Register | Log in

Subscribe Now>>
Home Tech2Tech Features Viewpoints Facts & Fun
Ask The Experts
Download PDF|Send to Colleague

Blaze ahead for better results

Front-running companies make use of active load architectures.

by Dan Graham

Most Teradata clients who want to activate their enterprise data warehouse (EDW) start with active load. Fortunately, active load is getting easier as Teradata TPump matures and more best-practices guides become available. Some sites start out simple; some leap straight into continuous data loading. Respondents to a recent survey by The Data Warehousing Institute cited numerous techniques as being important to their operational business intelligence (BI) strategy, including accelerating the frequency of batch loads (38%) and trickle feeding data into the EDW as events occur (24%).

But how do these front-runners who have already blazed the path to active load do it? What architectures are they using? Here we examine four of the many ways data gets from the front line into the EDW and back to the operational front-line users.

Simple extract, load and transform (ELT)
Active loading can be implemented using very simple ELT methods. One telecommunications company wanted to give its 4,000-plus call center customer service representatives (CSRs) more accurate and timely information to solve customer problems faster. When the company realized the EDW had 80% of the data needed, the goal became to load the other 20% into the EDW.

Every day as CSRs handle calls, rows are written into the call center database. Similarly, self-service interactive voice response records are captured showing customer menu choices. Since only 1,000 of these records are loaded every five minutes, the company decided to use BTEQ mini-batches into staging tables. iWay software is used for change-data capture to extract data from the source database. A BTEQ job then inserts these rows into a Teradata staging table every five minutes to begin the transformation process, which includes:
Account and phone number matching
Call event categorization
Accommodating data from new customer channels
Moving the data into the final target base tables
Loading several hundred thousand records daily

Using the fresh data, a BEA WebLogic portal delivers recent contact history to the CSRs via Web services calls to the iWay integration server, which issues SQL requests to the Teradata Database. The application has a two-second response time service level agreement (SLA) for the active accesses. Averaging 50,000 inquiries a day with peak loads of 10,000 tactical queries an hour, the Teradata system easily supports concurrent call center BTEQ loading, MultiLoad jobs and hundreds of concurrent tactical queries.

TPump straight-through loading
In a second example, a major retailer wanted up-to-the-minute reports for supply chain managers and store managers. Operational managers use the reports to detect stores selling items faster or slower than expected. They wanted to duplicate the tactics of the stores that sell beyond expectations in the stores that are lagging behind. So, the retailer's goal was to produce near real-time reports and scorecards in fewer than five minutes—from printing the consumer's register receipt to operational reporting. With point-of-sale data already loading into the mainframe, they decided to pass the purchase receipts data to TPump via IBM WebSphere MQ. (See figure 1 above.)

Since a transaction receipt has many complex line items—stock keeping units (SKUs), payment type, price, coupon discounts, taxes, etc.—the programmers use an access module for transformations. The access module is a program written in C that is linked to TPump. The programmers need TPump to insert records into as many as 40 tables from one purchase transaction. For each transaction, the access module calls one of 40 subroutines to properly reformat and transform the incoming record types.

Because of the company's database design, all data is loaded via inserts, switching to an upsert only if a "record exists" error occurs. This approach doubles the loading throughput versus using only upserts. The access module routine also includes the "Notify Exit," which captures errors and sends them back to the access module to be processed in real time. The application then fixes the most common errors, thus increasing data accuracy while reducing downtime.

During the busy seasons, the system loads up to 4 million transactions per day with an average of 10 to 12 records per transaction (40 million to 50 million records per day). The CPU overhead for the entire process on the Teradata server is only 1% to 2%. Unfortunately, on peak sales days, the overloaded mainframe becomes a performance bottleneck. But the designers were smart and built the TPump processes and scripts to run on either UNIX or mainframe servers.

After the design was put into production in 2005, two obvious benefits developed:
The data is loaded continuously and only once into the Teradata system.
All of the maintenance is contained in a single access module so new subroutines can be added.

TPump into staging tables
Because of the unpleasant reality that their competitor is only two clicks away, dot-com companies have a tendency to think of everything in real time. Consequently, one travel company had so many real-time load requests that it had to prioritize projects by return on investment (ROI). To cope with the flood of data, the company designed an architecture using TPump that could be replicated to handle different data streams with simple scripting changes. The variety of source data—click-streams, supplier prices, Web page graphics, offers and deals, sales bookings—and service level goals demanded a robust design.

The design uses two TPump jobs, each servicing one WebSphere MQ end point. One of the two TPumps is in sleep mode while the other actively loads staging tables. The mini-batch "end-of-file" condition is forced every two minutes or after a specified number of rows have been loaded. Regardless of which happens first, the TPump job stops taking in new rows and starts transformations on its mini-batch via stored procedures. Meanwhile, the sleeping TPump job wakes up and starts accepting rows from WebSphere MQ into a second staging table. After the first TPump job finishes emptying the staging table into the base tables, it goes to sleep and waits for its next turn. (See figure 2.)

In 2005, the company started running eight TPumps, loading more than 1 million rows a day in near real time. The TPump jobs were also designed to do catch-up after a planned outage. In this case, the TPump job will read large flat files of saved updates and blast them in once the Teradata system is operational again. Today, the number of TPump load streams has expanded for data when latency must be short. When the latency does not have to be measured in minutes, the programmers use MultiLoad and FastLoad mini-batches intra-day followed by extract, transform and load (ETL) late at night.

This is the quintessential mixed workload application for a data warehouse. Consequently, Teradata Active System Management plays a vital role by managing service levels for the widely varied user communities and tasks. Using priority management techniques, the consumer Web site typically achieves one- to two-second performance, reports run normally and the load jobs run in spurts as the data arrives.

Ultimately, the company was able to feed its Web site with fresh data on price changes, offers, advertising graphics and consumer campaigns in real time using its Teradata system. While loading up to a million records a day with TPump, the company also concurrently serves 400,000 decision support queries daily and more than 1 million Web site requests to its Teradata system.

Replication and ELT
Another dot-com company—this one an e-tailer—also needed multiple load strategies running at the same time. One of the biggest concerns—after order bookings, of course—is marketing campaigns to online users. Like an air traffic control tower, this company has dozens of marketing people cross-selling to customers to close added business, adjust ongoing deals and reach out to customers every day. Campaigns as well as deals could start one day and end the next. Marketing staff must invent campaigns, assess profitability, recruit customers to visit the Web site and sell products—all within 48 hours. With the help of the Teradata Relationship Manager and Teradata Value Analyzer, this stress was alleviated. (See figure 3.)

Because of SLAs on data latencies, more than 70% of the e-tailer's data has to be loaded intra-day, leaving mostly ERP data for nightly batch ETL. The company turned to three products for fast deployment of their load strategies: GoldenGate TDM (Transactional Data Management) for replication, Business Objects (an SAP company) Data Integrator for nightly batch ETL and Oracle's Sunopsis data integration tools for ELT transformations. With GoldenGate TDM, the company can grab Oracle database "redo logs" from the Web site and load them into Teradata staging tables at the rate of 5 million to 20 million a day.

More than 650 Oracle tables are monitored for updates that are replicated to the Teradata system each day. Because GoldenGate TDM supports many data sources, it is also used to capture "redo logs" from Microsoft SQL Server applications. To feed the Business Objects virtual Rapid Marts inside the EDW, the company relies heavily on Sunopsis to run 180,000 mini-batch loads per day. Once the batches of data are in the staging tables, Sunopsis, coupled with a job-scheduling product, executes around 180 transformation jobs per hour throughout the day. A few BTEQ jobs are also run for lower-priority transformations.

This company's compelling results provide its management with full Business Objects dashboards that have hourly profit-and-loss summaries. Along with the expected merchandising reports, campaign results and return customer analysis, the company has push-button consumer metrics showing new customers, retention rates, repeat buys, average revenue by customer segment and product line and, of course, profit measures.

Common themes
Each of these companies shares at least three outcomes from its use of active load: First, the results dispel the myth that near real-time loading has to be expensive. One company said it cost 2% of its Teradata server capacity to do active loading. While each company spent the typical project costs in labor, programming and operational tasks, nearly all said no hardware upgrade was required.

Second, each company relies on Teradata Active System Management to deliver on front-line user service levels while maintaining good load speed.

Last, each of these sites was continuously loading all day, every day—way back in 2005. These front-runners have done the debugging and established best practices for other companies to follow. That's good news for the EDW sites that will be joining the other front-runners who do intra-day, active loading already. T

Dan Graham has more than 35 years in IT and leads Active Data Warehouse Technical Marketing at Teradata.

Teradata Magazine-June 2008

Related Link

Reference Library

Get complete access to Teradata articles and white papers specific to your area of interest by selecting a category below. Reference Library
Search our library:

Protegrity | About Us | Contact Us | Media Kit | Subscribe | Privacy/Legal | RSS
Copyright © 2008 Teradata Corporation. All rights reserved.