Clickstream Data Warehousing
      Companion website for the book
      by Mark Sweiger, Mark Madsen, Jimmy Langston, and Howard Lombard

             Published by John Wiley & Sons, January 2002


Search this site or  the web
powered by FreeFind




    Site search
    Web search
Book Table of Contents
Clickstream Data Warehousing is in written in two parts, consisting of a total of nine chapters, prefaced by an Introduction.

The Introduction describes the purpose of the book, its intended audience, and the structure of the book.

Part 1, called "Clickstream Data Warehouse Architectural Foundations", explains, in careful detail, web technology and infrastructure as it relates to a clickstream data warehouse. Part 1 has four chapters:
  • Chapter 1, "A Typical e-Business Architecture", describes the components of an e-business information system architecture and shows how they relate to clickstream data warehousing. The environment includes client user systems, ISPs, web servers, applications servers, cached content servers, advertising engines, search engines, business transaction servers, the public Internet, common carriers, private intranets, and of course, clickstream data warehouses. This canonical architecture is used throughout the rest of the book.
  • Chapter 2, "The Web Application Environment", describes the unusual stateless web application environment, and introduces basic concepts like HyperText Transfer Protocol (HTTP), HTTP header fields, query strings, Common Gateway Interface (CGI), cookies, web server log records, scripting languages, web servers, and application servers. The rest of the book assumes that the reader is familiar with these concepts, making this a critical architectural foundation chapter.
  • Chapter 3, "Clickstream Data Sources and Web Server Log Files", is an in-depth analysis of the web-based data sources for a clickstream data warehouse. It covers the basics of web server log files and standard log file formats and provides a detailed analysis of the many issues one encounters with web-based data before moving on to other data sources like cache servers, web application servers and media servers. Log file data is the major clickstream data source, and we consider this chapter to be the benchmark source for information on the formats and use of log file data.
  • Chapter 4, "Using Cookies and Other Mechanisms to Track User Identity", delves into the critical issue of establishing and tracking user identity on the web and in the data. This chapter describes the mechanisms used to manage user sessions, identify users and track them across visits, as well as some of the data management issues involved. It also discusses the business and ethical issues surrounding user identity and user privacy, which have to be carefully thought through in any clickstream data warehouse implementation.
Part 2, "Building a Clickstream Data Warehouse, Step-by-Step", is a handbook on how to design and implement a clickstream data warehouse. This portion of the book covers all the implementation issues, from project staffing and management, to schema design, to extract transformation and load, to end-user analysis. Part 2 has five chapters:
  • Chapter 5, "Planning, Managing and Staffing a Clickstream Data Warehouse Project", contains a nutshell description of all the phases of a clickstream data warehouse project. In a series of "Lessons Learned the Hard Way" sections, we provide insight into how to avoid the problems and pitfalls of a typical project. The chapter ends with a discussion of project roles, staffing needs and how to organize the project team, including potential organization charts.
  • Chapter 6, "The Clickstream Data Warehouse Meta-Schema", is one of the most important chapters in the book. This chapter describes the clickstream data warehouse meta-schema, a universal template used to guide the logical design of any clickstream data warehouse schema. The schema includes the User Activity Fact table, and 10 possible dimensions, including User, Content, Activity, User Time, Fiscal Time, Physical Geography, Web Geography, Site Geography, Internal Promotion, and External Promotion. This template is used as a vehicle to bridge the communication gap between business users and the schema designers, and it ensures that all the important facts, dimensions, and attributes of a clickstream data warehouse will be considered in the course of the logical schema design.
  • Chapter 7, "Implementing the Appropriate Clickstream Data Warehouse Technology Infrastructure", contains information you will find nowhere else on the physical database design and technology infrastructure of a clickstream data warehouse. We find that most books on data warehousing gloss over physical design and technology infrastructure issues, leaving this very complex subject as an exercise for the reader. This chapter discusses:
  1. Efficient bulk and batch load techniques
  2. Table partitioning, including range, hash and composite partitioning
  3. Indexing, including b-tree, bitmapped, function, and partitioned indexes
  4. Joins, including star joins, the Oracle star transformation, and partition-wise joins
  5. Dimensional aggregate management, including aggregate creation candidates, database optimizer aggregate awareness, aggregate navigation, and materialized views
  6. Database parallelism, including tips on how to parallelizing database operations using block, key, and hash partitioning, as well as parallel database processes
  7. New extensions to SQL to support clickstream data warehousing, including Top-N ranking inside SQL statements, the ROLLUP operator for aggregate creation, the CUBE operator for cross-tabulation
  8. Disk drive and logical volume management including concatenation, striping, mirroring, RAID plexes, etc., and how database objects like tables, indexes, tablespaces, log volumes, temporary tablespaces, map onto volumes and disk drives
  9. A section on choosing products from various vendors including database software vendors, logical volume management software vendors, and disk subsystem vendors
  • Chapter 8, "Building the Clickstream Extract, Transformation and Load Mechanism", describes the fundamentals of clickstream data warehouse extract, transformation and load. The chapter includes a detailed discussion of the 8 steps needed to build a clickstream ETL mechanism. It also has an extensive example that shows exactly how clickstream data is processed into the warehouse.
  • Chapter 9, "Analyzing the Data in the Clickstream Data Warehouse", offers practical solutions to the problem of querying very large clickstream data warehouses, including a discussion of relational OLAP, multidimensional OLAP, and hybrid OLAP query environments. The chapter also shows how the chosen OLAP query environment can utilize efficient query techniques like partition elimination, materialized views, and server-side GROUP BY calculations to meet end-user performance expectations.

Book Endorsements     Book Table of Contents    Book Authors    Referenced Material  
  Related Articles    Related Links    Download a Project Plan    Discussion Forum     Links    

© Copyright 2001, 2002 Clickstream Consulting, All Rights Reserved