|
In the Chapter 8 of the book we discussed issues that make clickstream data warehouse ETL different from ETL for a data warehouse with no web-based data sources, and aspects of the build versus buy decision. In this short article the goal is to provide some basic evaluation criteria assuming that the decision is to purchase an ETL product rather than custom build this part of the data warehouse.
The process for evaluating and purchasing an ETL tool for a clickstream data warehouse is about the same as for any other data warehouse. The only thing that changes is the addition of some criteria that the product should meet to ensure that the purchase is suitable for clickstream data. Otherwise, you should plan for a standard product evaluation process.
The following section provides a set of six general criteria for ETL tool evaluation that apply to both clickstream and non-clickstream data. The seventh section contains a list of clickstream-specific ETL criteria that should be included in any tool evaluation.
1.
Product Architecture
There are a number of different architectures for ETL products, from code generators to SQL rule-execution products to ETL engines. The ETL tool market has been tending toward engine-based products in the past few years. This generally means that the processing steps associated with transformation and cleansing are executed by a an ETL engine rather than via separately compiled code. The main things to look for with regard to how the product is implemented are:
- Platform support. The support for source and target operating systems and databases is an obvious make-or-break item for most projects. Often ignored is the products execution platform. Some products support many sources and targets but execute only on a low-end platform like Microsoft NT, restricting the ability to scale to increased data volumes without adding more servers and the management and licensing overhead that entails.
- Modularized design. A modular design allows the product to be extended for additional sources, targets, or types of processing by purchasing add-on modules. This generally allows you to purchase only what you need and avoid being overburdened with features you will likely never use.
- Recovery logic. If there is a problem that results in a system crash during the ETL processing you want to be able to recover to the point it last completed successfully. Likewise, the ETL product itself should not require manual work to correct problems after a crash due to its handling of metadata, schedules or job states.
- Restart logic. Similar to recovery logic is the ability to restart ETL processing if a processing step fails to execute properly. You want the ability to restart processing at the step where it failed as well as the ability to restart the entire ETL session.
- Intermediate storage support. The product should allow for the use of intermediate storage between processing steps. Typically, this is in a relational database, but the product should allow staging and interim processing in both databases and files. A key item here is the support for simulataneous activities regardles of location. For example, you might have data coming from a relational source, to be combined with data from a flat file. You should not be forced to bring all data into a single format prior to working with it.
- Parallel support. This means different things to different vendors. One component of parallel execution is that the product should be capable of executing multiple different jobs or modules concurrently. The second component is the ability of the product to support multi-threaded operation and parallel code execution internally, so that a given job can take advantage of the inherent parallelism of the platform it is running on. The third component is load balancing, if applicable. The product should be able to distribute load across multiple servers if it operates under a centralized control mode.
- ETL Reporting. Every product should provide reports on ETL sessions and jobs so you can see how the system is functioning.
- Online help. Help should be provided for both developer and administrator tools, and should be available in an online format that allows access from within the ETL environment as well as seprately.
2. Data Extraction and Loading
Data extraction is closely related to platform support. The basic support one should look for is the required platforms, formats and data types in the environment. Beyond this, there are a number of important factors.
- Native support for the primary RDBMS vendors, as well as ODBC support for a broader set of databases.
- The capability to use both push and pull methods for all supported platforms.
- Support for non-relational data formats like VSAM and IMS, and data formats like XML.
- Ability to convert between character formats like ASCII and EBCDIC.
- Support for star schemas, including surrogate key generation, dimension support, and dimensional aggregates.
- Ability to extract, directly or via third party modules, from packaged applications like Oracle Applications or SAP.
- The ability to extract from multiple sources simultaneously and independently.
- Support for both rule-based extracts (like all new or updated orders since yesterday) and changed-data capture that allows extraction of data when there is no rule that can be applied to the source data.
- Both event-based and bulk data extraction. In some environments its important to take data in near real-time, so event-based models that send data may be required.
- Multiple methods for data movement. For example, the product should support standard middleware as with ODBC, as well as messaging products like Tibco or MQ Series and file-based methods like file copying and ftp.
- Support for native database features, including SQL feature support and especially integration with database bulk loaders for fast loading of data stored in files.
3. Data Transformation
The bulk of the work during ETL is in this area. This area is also fairly well understood by most programmers, so the criteria can be fairly obvious. The primary factors affecting evaluation include:
- Rule transformations are the most obvious requirement. You want to be able to specify that transformation are executed only when certain conditions are met, and you want to be able to merge these rules together in many different ways. It pays to have samples from your environment and check to see how to implement them in the tool.
- Pre-built rules and transformations. There are many frequently-encountered tasks that vendors can provide pre-built functionality to support. This can be simple things changing letter case to more complicated tasks such as name and address explosion from a single flat record.
- Support for all basic mathematical and statistical functions.
- Basic data cleansing functionality should be available.
- Code set conversions and record parsing are a key component in environments with many legacy systems.
- Support for third party data cleansing applications. Often, you cant rely on the ETL vendors product for sophisticated cleanup and you dont want to create this yourself. For example, many companies use a third party product if they need to handle customer name and address cleanup.
4. Metadata Management
Any reasonable ETL tool should have a built-in metadata repository that contains, at a minimum, the metadata concerning all of the data sources, targets and transformations. One of the goals of using an ETL tool is to allow for quicker and easier development, and management of what has been developed later. Without robust metadata management you may as well build your own ETL subsystem from scratch. The important criteria in this regard are:
- Extensibility. The metadata repository should be easily extensible so that you can add items that the product did not take into consideration. The vendor typically tries for the least common denominator with metadata, so the ability to customize is often a requirement.
- Open storage format. The metadata should be stored in an open, documented and easily accessible format. The obvious choice is in a relational database, preferably of your choice and not the vendors. There is no reason to store metadata in a proprietary format or non-ODBC compliant data store. If the ETL product does this then customization or integration with other products will be difficult if not impossible.
- Modeling tool integration. Since most companies make use of a schema design or data modeling tool, the ETL product should be able to import and export data models to modeling tools. Lack of integration implies that the ETL tool better be able to do schema auto-discovery and save you the time of entering every table and column definition into the system.
- Metadata sharing. Every ETL tool should have the ability to share metadata with third party applications. Typically, you should look for support of common metadata interchange standards. This sharing includes integration with business intelligence products. Theres little thats more annoying to the data warehouse administrator than having to re-enter all of the target metadata from the ETL tool because it wont exchange metadata with the client tools.
- Reporting. Any ETL tool with a metadata repository should be able to report on the contents. Its important to see things like what systems data is coming from, where its being used, what data is ignored, etc. To some extent, lack of reporting can be worked around if the metadata is stored in an open format.
5. Development Environment
During the project, the development environment is the place where the ETL programmers will spend all their time. Since different people have different preferences, its hard to put concrete criteria in place to cover all the kinds of things the environment should support. However, within that environment there are a few items that are key to an effective ETL tool:
- Integrated toolset so the developers do not need to switch from one interface to another, as might happen when developing a transformation and then having to make changes to metadata.
- Support for a graphical environment, particularly when dealing with scheduling and job dependencies.
- Command line support for invoking jobs or entire sessions is particularly useful. This allows for the creation of much more flexible ETL processes. If you have to go back to the GUI to schedule and execute every job, it can be painful, especially when a job fails late at night.
- Support for both batch and transaction style processing. Since there are cases where some data is fed as it comes in, the ability to take a single set of data through the entire process from beginning to end is helpful. Otherwise you are constrained to executing these trickle feeds in batch mode, which may not be the most efficient.
- Session nesting is another useful capability. When developing an ETL session for a particular part of the system, you may later want to nest that within another ETL session. Some tools do not allow this, meaning duplicate development.
- These days its obvious, but some tools still do not support sequential processing, which generally means support for cursors in relational databases. Without cursor support in a SQL environment, you can sometimes be forced into inefficient SQL processing or worse, a user exit to custom code.
- Equally obvious, the ETL product should have robust SQL support. Often, it is much faster to execute steps in a process as generated SQL statements that as generated code that reads and writes to the database.
- A capability that is sometimes overlooked is debugging functionality.
- Another capability that is overlooked, and which can make an ETL developers life far easier, is the ability to do mass changes. For example, is a source system changes and several extract and transformation jobs change, you dont want to locate and change every single occurrence. A one-time mass change is critical.
- Related to mass changes is the ability of the product to do dependency tracking for the components that make up an ETL session. If a change is required in one place, its vital to be able to trace what other components might be affected.
- Any ETL product should support version management. If theres an error in a new ETL job, you want to be able to quickly roll back to what worked before rather than manually making code changes. Since a number of these tools do not provide access to any of the code that runs the jobs, your own version management may not work with the ETL product.
6. ETL Administration
After ETL development is complete, ease of administration becomes the topmost concern. The best ETL development environment in the world is useless if the process cant be easily administered. More than one company has been burned by lack of operational support when trying to deploy a data warehouse. The key elements to look for are:
- Centralized administration. Its critical that there be a single place to view the entire ETL process, even if jobs are running across both the source and target platforms under different operating systems. Updating job schedules and managing jobs should be location-independent.
- Error processing. When something goes wrong within an ETL session it is vital that you be able to pinpoint what went wrong with what job or task. Further, the product should support different mechanisms for notification, such as console messages and email alerts. Ideally, the error management is independent of the ETL session so that if the ETL engine dies, its still possible to send an alert.
- Audit. When an ETL session completes, the data warehouse administrator should be able to view how much data was processed by each job, where the data came from, where it went to, and how long it took. Over time, this basic audit information will help resolve a lot of ETL problems.
- Scheduling. The scheduling of tasks within an ETL session is often not a trivial task. There are many dependencies between jobs, making it hard to determine what tasks can continue to run when an earlier job fails. Since many operating platforms have minimal support for task scheduling, built-in support is helpful.
7. Clickstream ETL Support
To quickly recap what we said in the book about processing clickstream data, the general structure of the clickstream ETL process is shown in the figure below. Refer to Chapter 8 for the details behind each of these steps and what additional criteria you might want to include in your own tool evaluation.
If you are purchasing an ETL explicitly for a clickstream data warehouse, then the following criteria should be used to evaluate that products clickstream data support.
-
Support for all standard web server log file formats
- Support for advanced server logging formats and third party products like streaming media servers, proxy servers and cache servers.
- Extensibility of the log format parsing code so that any customization done to the web application logging can be read by the ETL product.
- Multi-byte character support, and Unicode in particular (important for multi-language web sites).
- Pre-built transforms for web-specific tasks such as cookie parsing or session generation.
- Pre-built functionality to deal with user and session tracking nuances, for example looking in both cookies and URL parameters for session or user IDs.
- Support for inter-record and multi-record logic, important when dealing with page identification and associating page elements to pages, as well as many fact calculations.
- Support for web application servers. This is important as many web application servers store vital site content and user information in some form of internal data store.
7. Conclusion
This list of evaluation criteria is based on our experiences evaluating and working with products from various vendors over the years. While it may not be as complete as many checklists, we feel that it covers the key aspects of an ETL product evaluation that are often missing from feature-based comparisons. The additional clickstream criteria are something that will evolve over time as ETL vendors become more sophisticated in their handling of web-based data sources, and as the web analytics vendors begin to merge with more mainstream business intelligence and ETL companies. |