Clickstream Data Warehousing
      Companion website for the book
      by Mark Sweiger, Mark Madsen, Jimmy Langston, and Howard Lombard

             Published by John Wiley & Sons, January 2002


Search this site or  the web
powered by FreeFind




    Site search
    Web search

Web Server Log Files: The Ultimate Clickstream Data Source

  By Mark Sweiger, President and Principal, Clickstream Consulting

In the last article, I shed some light on the inner workings of the mysterious cookie file, the principal user identity mechanism for a clickstream data warehouse. Armed with that knowledge, we now turn our attention to the principal data source for a clickstream data warehouse, Web server log files. While clickstream data warehouses typically have multiple data sources just like any other data warehouse, one type of data source dominates all the others, namely the records contained in Web server log files.

In many ways, Web server log file records are a dream data source. They are automatically emitted by every popular type of web server, including Apache, Microsoft IIS, IBM Websphere, BEA WebLogic, etc. There is one record for each HTTP transaction, providing a very fine granularity of user activity detail. And unless the site chooses to customize its logging, log records have a universal standard format, the NCSA Combined Log File Format, making them exceptionally easy to parse and manipulate.

Let’s take a look at the Combined Log File Format and see what’s inside. The Combined Log File Format contains the following fields:

  1. h = The remote requestor ID, in IP format (example: 165.32.124.3) if Directory Name Service (DNS) is not enabled
  2. I = The remote username via identd, which is blank (indicated by a ‘-’) unless the client is Unix or Linux.
  3. u = The HTTP authenticated username, which is likely blank unless your site requires user registration
  4. r = The entire text of the HTTP request
  5. s = The returned HTTP status code
  6. b = The number of bytes in the returned page
  7. {referer}i = most recent referring URL, taken from the ‘referer’ field of the HTTP request header
  8. {user-agent}i = the requesting software type, usually a browser like Microsoft IE or Netscape, but it can also be things like search engine robots
Here is an example of three Combined Log File Format log records from my company’s web site, ClickstreamConsulting.com:

204.243.130.5 - - [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437 "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)"

204.243.130.5 - - [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)"

204.243.130.5 - - [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)"

Let’s examine the first log record. The first field, ‘204.243.130.5’, is the IP address of system issuing the HTTP requests. It is quite likely that this IP address is an Internet proxy for the client, like a server at his Internet Service Provider (ISP). As a matter of course, I run a DNS mapper against our site’s log file to decode IP addresses into real host names (like earthlink.net, or ibm.com), but in this case the IP address did not correspond to a registered domain name and so it was not decoded. This happens more often than you might guess, somewhere between 20% and 50% of the time in my experience. The next two fields in the log record are “blanks”, which are indicated by the two hyphen placeholders. These blank fields indicate that the client was not authenticated by identd or by HTTP authentication, which makes sense because currently my site does not require any authentication to access it. The next field, [26/Feb/2001:15:34:52 -0600], is the GMT time when the user’s request was logged. The next field, "GET / HTTP/1.0", is the actual text of the HTTP request, which in this case is a GET of my site’s home page using the shorthand “/” for the root page of the site. The next field, ‘200’, is the returned HTTP status code, and 200 means that the operation was successful. The next field following the status code, ‘8437’, is the actual number of bytes transferred to the client as the result of the HTTP transaction. If the number of bytes transferred is not equal to the number of bytes in the actual home page, then we know that the user abandoned the GET by hitting STOP, BACK, or by clicking through to another link before the page was fully loaded. The next field, "http://metacrawler.com/crawler?general=dimensional+modeling", is the referring URL, and in this case the referrer is the search engine metacrawler.com, which found my site as the result of a search for sites that mention the term “dimensional modeling”, which is specified by the query string parameters after the ‘?’ in the URL. The last field in the log record, "Mozilla/4.5 [en] (Win98; I)" indicates which type of browser and OS the user employed to get this page, and in this case the browser was Microsoft IE version 4.5 running on a Windows 98 system.

Moving on to the second log record, we see that it records an HTTP GET of the image “logo1.gif”, which is the file containing the logo of Clickstream Consulting. This additional GET is embedded inside the HTML that generates the home page of ClickstreamConsulting.com, which resulted in the creation of the first log record above. The existence of this second log record for the home page illustrates a key concept: Most page views generate multiple HTTP transactions, resulting in multiple log records per page view.

The final log record in the example indicates that the user decided to visit our site’s “Articles and Information” link, whose content is in the file “articles.html”. This page contains a variety information on subject of clickstream data warehousing. By subtracting the logged GMT time in this record from the GMT time in previous log record, we can see that the user viewed our home page for 34 seconds before clicking on the link for the “Articles and Information” page. Since no more log records for this IP address appear after this one, the exited the site probably by clicking on one of the many external links on “Articles and Information” page, making it the exit page for this user’s visit.

While the fields contained in the standard Combined Log File Format are obviously useful as the clickstream data source, additional fields can be added to the log record by adding additional objects to the log record template used by your Web server’s logging module. Virtually all Web server software supports extensions to the format of the log file.

For example, if your site creates client-side cookies, you should always append the resultant cookie text to your Web server log file records. In an Apache or Netscape web server environment, adding the string “{cookie}i” to the log record format causes the client-side cookie to get logged as the string “cookie-variable-name=value” for each HTTP transaction. A logged cookie is a much more accurate way of establishing user identity than a host IP address. Many ISPs, including AOL and Microsoft Network (MSN), change the user IP address with every transaction, making it impossible to group log records into a visit by a single user using the IP address, as we did in the above example. Unless the log file format is extended to log a user’s cookie, it would be impossible to do visit analysis on users that come from those ISPs since one cannot establish which set of records pertain to a particular user because of the ever-changing IP addresses. And because AOL and MSN have such a large combined user community, currently numbering in the tens of millions, failure to include the cookie in the log record could mean that the majority of a site’s clickstream data could not be analyzed at a visit level of granularity.

In the next article we will dive into the clickstream data warehouse extract, transformation and load (ET&L), that is, the process of taking the flat file data contained in the raw log file records and loading it into a dimensional clickstream data warehouse.

Book Endorsements     Book Table of Contents    Book Authors    Referenced Material  
  Related Articles    Related Links    Download a Project Plan    Discussion Forum     Links    

© Copyright 2001, 2002 Clickstream Consulting, All Rights Reserved