Do your data warehousing efforts help or hinder creativity?
by Rob Armstrong, director of data warehouse support at Teradata
During a recent panel discussion among data warehouse practitioners, we contemplated the pros and cons of a centralized versus distributed
architecture. An argument that caught my attention was whether centralization spurs or spurns innovation.
My contention for centralization was that the data warehouse increases a company's ability to be innovative. Another panelist, arguing that
the data warehouse inhibited creativity and exploration, vied for a distributed architecture. We agreed in the end that both viewpoints were
correct.
In data warehousing, the time lag from when an idea is originated to when it is proved or disproved through analytics to when it is
turned into an action item determines whether the innovation is a success. If gathering data is cumbersome, innovation is stagnated. If
combining new data in new ways is easy, innovation is supported.
Centralized argument
The main role of the enterprise data warehouse (EDW) is to link critical data from multiple subject areas to provide a consistent and
cohesive view of the business. To preserve the data relationships, the data is extracted from source systems and transformed into a
normalized model. As more subject areas are added to the data warehouse, more relationships can be explored.
Innovation progresses when the data is integrated across the enterprise or subject areas. Then, as new ideas emerge, users do not need to
reconcile data inconsistencies during queries and analyses. Rather, the data is available in a neutral data model and can be accessed
directly. This eliminates the need to index, denormalize, build cubes and conduct other time-consuming tasks when new sets of analytics are
run.
In this sense, the data warehouse accelerates "enterprise innovation." This is evident in cross-functional opportunities when data
integration from multiple subject areas is critical to getting consistent and meaningful answers.
Departmental argument
By using a data mart, the business community can leverage data without worrying about its quality or prioritization since the data will be
loaded into a separate database more quickly and easily than into the EDW where integration is required. Because the data can be loaded more
quickly and departmental users can customize the data model to their own needs, analytics can run without any delays from enterprise
oversight and without contention of other queries and workloads. From this perspective, the data marts clearly enhance "functional
innovation."
Freedom with structure
This quandary of needing departmentalized data sets but still needing enterprise consistency and integration of that data led Teradata to
introduce the Teradata Data Warehouse Appliance 2550 and improve priority scheduling with Teradata Active System Management.
These solutions give the departmental users two options:
|
Leverage the smaller platform as an adjunct of the EDW to maintain enterprise consistency and governance but at a lower cost
and with less rigor.
|
|
Use Teradata Active System Management to create logical sandboxes within the EDW, thereby preserving both departmental and
functional innovation. In this logical data model, sandboxes are dedicated data areas that fall outside the IT rigor but are
implemented into the total EDW architecture.
|
Both solutions leverage the same SQL, load tools, connectivity and database management. These consistencies minimize IT's efforts to fully
integrate the new data or applications into the EDW when it makes sense. Hence, time is saved and all types of innovation are supported.
Play in the sand
To be innovative, companies must be agile at the departmental level while providing integration at the enterprise level. Sandboxes can help.
Because they can be housed either on the EDW or in other platforms that are tightly linked to the EDW, sandboxes enable individual companies
to quickly add new data elements and exploratory data sets to their total data warehouse environment without creating a series of unrelated
data fiefdoms. Let's explore the two housing options:
Building sandboxes within the EDW
An option is to build a sandbox inside an existing EDW since much of the data necessary for any new analysis is most likely already in the
warehouse. Giving the users a portion of disk space allows them to load new data, create extracted exploratory data sets or add data of
personal interest to increase their ability to segment, correlate or perform new analytics. This sandbox data is not of production quality
and is administered by the users.
This option employs the parallelism of the production database. Data movement is minimized, and the same tools, utilities, modeling and SQL
are available that are in use at the production level. This commonality will ease the eventual migration of the data from the sandbox to the
production environment. It will also help IT understand the complexity and integration issues that must be addressed when the data is moved
into the EDW so they can perform a more accurate cost-benefit analysis when deciding on upcoming projects.
Besides implementing sandboxes physically as indicated above, another benefit to this approach is that it can be done virtually using
views, or by using a combination of the two procedures. Any data already on the system can be accessed and manipulated via views to further
cut down the amount of data duplication and movement.
This centralized-system approach, however, places the new sandbox in competition with the production workloads, even with Teradata Active
System Management. In the end, with the data and corresponding applications headed toward integration from the sandbox into the production
system, IT can better understand the overall performance, impact and concurrency levels that must be addressed in the workload management
profiles.
Building sandboxes outside the EDW
A common method to building sandboxes is to use a secondary platform. A platform outside of the data warehouse eases the burden on the IT
community while responding to the immediate challenges faced by the users. It also allows the users to take charge of their own environment.
But a secondary platform must be undertaken carefully lest anarchy takes hold and the overall data warehouse effort suffers. Diligence must
be taken that the exploratory sandbox environment does not become a "shadow production system" and usurp the role of the data warehouse.
Accomplishing this balance between user freedom and enterprise consistency requires governance and cooperation among the user, IT and the
executive communities. The business community must agree to reuse data from the production system whenever possible. IT must agree to assist
business in ensuring that data models, load processes and tool sets are aligned with corporate strategy and the centralized data warehouse.
And the executive community must ensure that the sandbox is funded, that its use is contained to exploration and innovation and, finally,
that the resulting data and corresponding applications will later be integrated into the EDW.
By taking this secondary-platform approach, users can quickly and easily add specialized data to current data sets, create analytical data
sets from production pulls and play "what if" games. They can also combine various data sets to test new hypotheses without having to compete
with prioritized work as they would with a platform-built sandbox.
A secondary-platform sandbox can provide users with insight and spur innovation, and also help them calculate business benefits. The users
can try new ideas and gauge their effectiveness before lobbying the steering committee to include the data and applications in the EDW. This
allows them to be innovative by quickly testing new ideas and incorporating only the best ones into the enterprise processes.
A sandbox is not concrete
It is important that neither sandbox approach become a replacement for a company's enterprise goals. The sandbox is a good playground for
users, but that is how it should remain. Data in the sandbox should be managed by the users, not by IT. The sandbox is not covered by
disaster recovery strategies, no archives are taken, and no long-term storage is created.
One common situation to watch for is that the users like the sandbox results so much, they demand that the analytics and reporting that were
developed as a test immediately become part of production and managed by IT. This is where the EDW vision starts to fall apart.
Certainly, the users can have their sandbox, but governance requires that any application or analytics derived from the sandbox go through a
proper cut over methodology so that any "production applications" be run against the main EDW.
Limiting the sandbox to a short-term environment encourages users to effectively leverage it for individual innovation. Data can and should
be loaded and played with, but once proven valuable, the data must be prioritized for inclusion in the EDW so as to enable further enterprise
innovation. T
Teradata Magazine-December 2008
|