Selecting an ETL Product for a Clickstream Data Warehouse

 

By Mark Madsen

In the Chapter 8 of the book, Clickstream Data Warehousing, by Mark Sweiger, Mark Madsen, Jimmy Langston, and Howard Lombard, (Wiley, 2002), we discuss issues that make clickstream data warehouse ETL different from ETL for a data warehouse with no web-based data sources, and aspects of the “build versus buy” decision. In this short article the goal is to provide some basic evaluation criteria assuming that the decision is to purchase an ETL product rather than custom build this part of the data warehouse.

The process for evaluating and purchasing an ETL tool for a clickstream data warehouse is about the same as for any other data warehouse. The only thing that changes is the addition of some criteria that the product should meet to ensure that the purchase is suitable for clickstream data. Otherwise, you should plan for a standard product evaluation process.

The following section provides a set of six general criteria for ETL tool evaluation that apply to both clickstream and non-clickstream data. The seventh section contains a list of clickstream-specific ETL criteria that should be included in any tool evaluation.

1.                 Product Architecture

There are a number of different architectures for ETL products, from code generators to SQL rule-execution products to ETL engines. The ETL tool market has been tending toward engine-based products in the past few years. This generally means that the processing steps associated with transformation and cleansing are executed by a an ETL engine rather than via separately compiled code. The main things to look for with regard to how the product is implemented are:

                                                                                                                           

2.                 Data Extraction and Loading

Data extraction is closely related to platform support. The basic support one should look for is the required platforms, formats and data types in the environment. Beyond this, there are a number of important factors.

 

·         Ability to extract, directly or via third party modules, from packaged applications like Oracle Applications or SAP.

·         The ability to extract from multiple sources simultaneously and independently.

·         Support for both rule-based extracts (like “all new or updated orders since yesterday”) and changed-data capture that allows extraction of data when there is no rule that can be applied to the source data.

·         Both event-based and bulk data extraction. In some environments it’s important to take data in near real-time, so event-based models that send data may be required.

·         Multiple methods for data movement. For example, the product should support standard middleware as with ODBC, as well as messaging products like Tibco or MQ Series and file-based methods like file copying and ftp.

3.                 Data Transformation

The bulk of the work during ETL is in this area. This area is also fairly well understood by most programmers, so the criteria can be fairly obvious. The primary factors affecting evaluation include:

 

·         Rule transformations are the most obvious requirement. You want to be able to specify that transformation are executed only when certain conditions are met, and you want to be able to merge these rules together in many different ways. It pays to have samples from your environment and check to see how to implement them in the tool.

·         Pre-built rules and transformations. There are many frequently-encountered tasks that vendors can provide pre-built functionality to support. This can be simple things changing letter case to more complicated tasks such as name and address explosion from a single flat record.

·         Support for all basic mathematical and statistical functions.

·         Basic data cleansing functionality should be available.

·         Code set conversions and record parsing are a key component in environments with many legacy systems.

·         Support for third party data cleansing applications. Often, you can’t rely on the ETL vendor’s product for sophisticated cleanup and you don’t want to create this yourself. For example, many companies use a third party product if they need to handle customer name and address cleanup.

4.                 Metadata Management

Any reasonable ETL tool should have a built-in metadata repository that contains, at a minimum, the metadata concerning all of the data sources, targets and transformations. One of the goals of using an ETL tool is to allow for quicker and easier development, and management of what has been developed later. Without robust metadata management you may as well build your own ETL subsystem from scratch. The important criteria in this regard are:

 

·         Extensibility. The metadata repository should be easily extensible so that you can add items that the product did not take into consideration. The vendor typically tries for the least common denominator with metadata, so the ability to customize is often a requirement.

·         Open storage format. The metadata should be stored in an open, documented and easily accessible format. The obvious choice is in a relational database, preferably of your choice and not the vendor’s. There is no reason to store metadata in a proprietary format or non-ODBC compliant data store. If the ETL product does this then customization or integration with other products will be difficult if not impossible.

·         Modeling tool integration. Since most companies make use of a schema design or data modeling tool, the ETL product should be able to import and export data models to modeling tools. Lack of integration implies that the ETL tool better be able to do schema auto-discovery and save you the time of entering every table and column definition into the system.

·         Metadata sharing. Every ETL tool should have the ability to share metadata with third party applications. Typically, you should look for support of common metadata interchange standards. This sharing includes integration with business intelligence products. There’s little that’s more annoying to the data warehouse administrator than having to re-enter all of the target metadata from the ETL tool because it won’t exchange metadata with the client tools.

·         Reporting. Any ETL tool with a metadata repository should be able to report on the contents. It’s important to see things like what systems data is coming from, where it’s being used, what data is ignored, etc. To some extent, lack of reporting can be worked around if the metadata is stored in an open format.

5.                 Development Environment

During the project, the development environment is the place where the ETL programmers will spend all their time. Since different people have different preferences, it’s hard to put concrete criteria in place to cover all the kinds of things the environment should support. However, within that environment there are a few items that are key to an effective ETL tool:

 

·         Integrated toolset so the developers do not need to switch from one interface to another, as might happen when developing a transformation and then having to make changes to metadata.

·         Support for a graphical environment, particularly when dealing with scheduling and job dependencies.

·         Command line support for invoking jobs or entire sessions is particularly useful. This allows for the creation of much more flexible ETL processes. If you have to go back to the GUI to schedule and execute every job, it can be painful, especially when a job fails late at night.

·         Support for both batch and transaction style processing. Since there are cases where some data is fed as it comes in, the ability to take a single set of data through the entire process from beginning to end is helpful. Otherwise you are constrained to executing these trickle feeds in batch mode, which may not be the most efficient.

·         Session nesting is another useful capability. When developing an ETL session for a particular part of the system, you may later want to nest that within another ETL session. Some tools do not allow this, meaning duplicate development.

·         These days it’s obvious, but some tools still do not support sequential processing, which generally means support for cursors in relational databases. Without cursor support in a SQL environment, you can sometimes be forced into inefficient SQL processing or worse, a user exit to custom code.

·         Equally obvious, the ETL product should have robust SQL support. Often, it is much faster to execute steps in a process as generated SQL statements that as generated code that reads and writes to the database.

·         A capability that is sometimes overlooked is debugging functionality.

·         Another capability that is overlooked, and which can make an ETL developer’s life far easier, is the ability to do mass changes. For example, is a source system changes and several extract and transformation jobs change, you don’t want to locate and change every single occurrence. A one-time mass change is critical.

·         Related to mass changes is the ability of the product to do dependency tracking for the components that make up an ETL session. If a change is required in one place, it’s vital to be able to trace what other components might be affected.

·         Any ETL product should support version management. If there’s an error in a new ETL job, you want to be able to quickly roll back to what worked before rather than manually making code changes. Since a number of these tools do not provide access to any of the code that runs the jobs, your own version management may not work with the ETL product.

6.                 ETL Administration

After ETL development is complete, ease of administration becomes the topmost concern. The best ETL development environment in the world is useless if the process can’t be easily administered. More than one company has been burned by lack of operational support when trying to deploy a data warehouse. The key elements to look for are:

 

·         Centralized administration. It’s critical that there be a single place to view the entire ETL process, even if jobs are running across both the source and target platforms under different operating systems. Updating job schedules and managing jobs should be location-independent.

·         Error processing. When something goes wrong within an ETL session it is vital that you be able to pinpoint what went wrong with what job or task. Further, the product should support different mechanisms for notification, such as console messages and email alerts. Ideally, the error management is independent of the ETL session so that if the ETL engine dies, it’s still possible to send an alert.

·         Audit. When an ETL session completes, the data warehouse administrator should be able to view how much data was processed by each job, where the data came from, where it went to, and how long it took. Over time, this basic audit information will help resolve a lot of ETL problems.

·         Scheduling. The scheduling of tasks within an ETL session is often not a trivial task. There are many dependencies between jobs, making it hard to determine what tasks can continue to run when an earlier job fails. Since many operating platforms have minimal support for task scheduling, built-in support is helpful.

 

7.                 Clickstream ETL Support

To quickly recap what we said in the book about processing clickstream data, the general structure of the clickstream ETL process is shown in the figure below. Refer to Chapter 8 of the book, Clickstream Data Warehousing, for the details behind each of these steps and what additional criteria you might want to include in your own tool evaluation.

If you are purchasing an ETL explicitly for a clickstream data warehouse, then the following criteria should be used to evaluate that product’s clickstream data support.

·         Support for all standard web server log file formats

·         Support for advanced server logging formats and third party products like streaming media servers, proxy servers and cache servers.

·         Extensibility of the log format parsing code so that any customization done to the web application logging can be read by the ETL product.

·         Multi-byte character support, and Unicode in particular (important for multi-language web sites).

·         Pre-built transforms for web-specific tasks such as cookie parsing or session generation.

·         Pre-built functionality to deal with user and session tracking nuances, for example looking in both cookies and URL parameters for session or user IDs.

·         Support for inter-record and multi-record logic, important when dealing with page identification and associating page elements to pages, as well as many fact calculations.

·         Support for web application servers. This is important as many web application servers store vital site content and user information in some form of internal data store.

Conclusion

This list of evaluation criteria is based on our experiences evaluating and working with products from various vendors over the years. While it may not be as complete as many checklists, we feel that it covers the key aspects of an ETL product evaluation that are often missing from feature-based comparisons. The additional clickstream criteria are something that will evolve over time as ETL vendors become more sophisticated in their handling of web-based data sources, and as the web analytics vendors begin to merge with more mainstream business intelligence and ETL companies.