|
|
By Mark Madsen, Principal, Clickstream Consulting
Portions of this article originally appeared in Database Programming and Design Magazine, now Intelligent Enterprise Magazine.
Introduction
The purpose of this article is to provide some practical guidance on how to implement a sensible aggregation strategy for a data warehouse. The goal is to help answer the questions "How do I choose which aggregates to create," "How do I create and store aggregates," and "How do I monitor and maintain aggregates in a database?" The information in this article has been gathered from several years of consulting in the relational decision support market. This article assumes some familiarity with dimensional or "star" schema design, as this forms the base from which data is aggregated.
Approaches to Aggregation
Before trying to answer the questions mentioned above, there are some basic tradeoffs to keep in mind. Creating an aggregate is really summarizing and storing data which is available in the fact table in order to improve the performance of end-user queries. There are direct costs associated with this approach: the cost of storage on the system, the cost of processing to create the aggregates, and the cost of monitoring aggregate usage. We are trading these costs against the need for query performance.
There are three approaches to aggregation: no aggregation, selective aggregation, or exhaustive aggregation. In some cases, the volume of data in the fact table will be small enough that performance is acceptable without aggregates. In a typical database the data volumes will be large enough that this will not be the case.
The opposite extreme is exhaustive aggregation. This approach will produce optimal query results because a query can read the minimum number of rows required to return an answer. However, this approach is not normally practical due to the processing required to produce all possible aggregates and the storage required to store them.
In a simple sales example where the dimensions are product, sales geography, customer, and time, the number of possible aggregates is the number of levels in each hierarchy of each dimension multiplied together. Figure 1 depicts sample hierarchies in each dimension and the total number of aggregates possible.

Figure 1: Number of possible aggregates. Each dimension has several levels of summarization possible. To determine the number of aggregates, simply multiple the number of levels in each of the dimension hierarchies. This will show the total number of aggregates possible.
Creating a large number of aggregates will take a lot of processing time, even on a large system. Aggregates are created after new fact data has been verified and loaded. Given the loading time and the time to perform database backups, there is a small window left in the batch cycle to create aggregates. This time window is a restriction to how many aggregates may be created.
Given the above constraints and the huge number of rows to store for every possible aggregate, it is apparent that an exhaustive approach is not generally feasible. This leaves selective aggregation as the middle ground. The difficult question becomes "Which aggregates should I create?"
Choosing Aggregates to Create
Usage and Analysis Patterns
There are two basic pieces of information which are required to select the appropriate aggregates. Probably the most important item is the expected usage patterns of the data. This information is usually known after the requirements phase of a decision support project. One of the areas which this requirements analysis normally focuses on is the decision making processes of individual users. Based on this information it is possible to determine that they often look for anomalies in their data by focusing at a certain level, and then looking for information at lower or higher levels based on what they find. The most frequently examined levels will be good candidates for aggregation.
As an example, someone looking at the profitability of car insurance policies may start by examining the profitability of all policies broken out by geographic regions. From there they may note that a certain region has a higher profitability and start looking for the contributing factors by drilling down to a district level, or looking at the policies by policy or coverage types. If this pattern of analysis is common, then aggregates by region and policy type will be most useful.
Base Table Row Reduction
The second piece of information to consider is the data volumes and distributions in the fact table. This information is often not available until the initial loading of data into the database is complete and it will likely change over time. After loading the data, it is a good idea to run some queries to get an idea of the number of rows at various levels in the dimension hierarchies. This will tell you where there are significant decreases in the volume of data along a given hierarchy. Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a hierarchy to the next.
The decrease of rows in a dimension hierarchy is not a hard rule due to the distribution of data along multiple dimensions. When you combine the fact rows to create an aggregate at a higher level, the reduction in the number of rows may not be as much as was expected. This is due to the sparsity of the fact data: as you look at the data values for a given dimension you will notice that certain values do not exist at the detail level, but the combination of all the dimensions will have a row at a higher level.
A simple example of this is in high volume retail sales. A single store may carry 70,000 unique products, but in a given day a typical store will sell only ten percent of those products. In a single week the store may sell 15,000 unique products. If we calculate the number of rows in the fact table for a chain with 100 stores where every store sells 7000 products a day, 365 days a year, we will have 255,500,000 rows.
If we create an aggregate of product sales by store by week we would intuitively expect that the number of rows in the aggregate table would be reduced by seven since we have summarized seven daily records into a single weekly record for a given product. This will not be the case due to the sparsity of data. Since each store will move 15,000 products in a week, the number of rows in the aggregate will not be 36,500,000; the number of rows will be 78,000,000, or double what we were expecting!
Since we are trying to reduce the number of rows a query must process, one of the key procedures is finding aggregates where the intersection of dimensions has a significant decrease in the number of rows. Figure 2 shows the row counts for all possible aggregates of product by store by day using one year of data for a 200 store retail grocer.

Figure 2: Row counts for possible aggregates. The base level of detail is product by store by day shown in the upper left corner of the chart. The highest level summary is total sales corporate-wide in the lower right, containing a single row for each day.
Looking at this chart, it is apparent that creating aggregates at some of the highest levels will provide minimal performance improvement. Depending on the frequency of usage, there are several likely candidates. Any of the subcategory level aggregates provide a significant reduction in volume and would be good starting points for exploration. The brand by district aggregate provides a very significant drop over the detail data, and will probably be small enough that all higher level product and geography queries may be satisfied by this aggregate.
One thing to keep in mind is that it is appealing to decide based on what you can see in the chart, but there are still tens of millions of rows in some of the lower level aggregates. Knowing how fast your database and hardware combination can move data for a query is still important and will help you determine where it is practical to stop aggregating.
Aggregate Storage
Once you have made an initial decision about which aggregates to create you have to answer the next question: how to create and store those aggregates. There are two parts to this question. The first is how to store the aggregated data. The second is how to create and update the aggregated data.
Storing aggregates can be complicated by the columns available in the base fact table. Some data may be invalid at a higher aggregate level, or it may not be possible to summarize the data in a column. For example, it is not possible to aggregate automobile insurance claims information by vehicle type and preserve information from the claim such as the gender of the policyholder. This will be true of most semi-additive fact data and all non-additive fact data.
Since semi-additive and non-additive data is only valid at the detail level, we will very likely have fewer columns in an aggregate table than we have in a fact table. This is a common issue for businesses such as insurance, catalog sales, subscription services, and health care. There are some ways to preserve a portion of the information. In the example above, two "count" columns could be added with the number of male and female claimants stored in each.
Another item which appears regularly in aggregate table design is the required precision of columns storing counts or monetary values. For data at a low level of detail, the values stored in a column may never exceed five digits. If the data is summarized for an entire week the column may require seven or more digits. This must be taken into account when creating the physical table to store an aggregate.
Storing Aggregate Rows
There are three basic options for storing the aggregated data which are diagrammed in figure 3. You can create a combined fact and aggregate table which will contain both the base level fact rows and the aggregate rows. You can create a single aggregate table which holds all aggregate data for a single fact table. Lastly, you can create a separate table for each aggregate created.
Figure 3: Three possibilities for storing aggregates. Aggregates may be stored in the same table with the base level fact rows, they may be stored in a separate aggregate-only table, or they may be stored in individual aggregate tables.
I normally recommend creating a separate table for each aggregate. The combined fact and aggregate table approach is appealing, but it usually results in a very large and unmanageable table. The single aggregate table is almost as unmanageable. Both approaches suffer from contention problems during query and update, issues with data storage for columns which are not valid at higher levels of aggregation, and the possibility of incorrectly summarizing data in a query.
The contention problem with a single table for detail and aggregates is straightforward: the same table is read from and written to in order to create or update the aggregate rows. Given the large batch nature of aggregate creation and update, contention during the batch cycle may be considerable. Query contention due to all end-user queries hitting the same table will be an issue, as will indexing in such a way that aggregate rows will be efficiently retrieved. The same drawbacks apply to a single separate table for all aggregate rows.
Using a separate table for each aggregate avoids these problems and has the advantages of allowing independent creation and removal of aggregates, simplified keying of the aggregates, and easier management of performance issues (for example, spreading I/O load by rearranging tables on disks, or allowing multiple aggregates to be updated concurrently).
The most difficult issue to resolve is the complication of end-user query access. The complication results from the introduction of a number of possible tables from which the data may be queried.
This design approach introduces a new factor into the selection of end user query tools, particularly ad-hoc query tools: they should be "aggregate aware". Products which are not "aggregate aware" will present users with all the fact and aggregate tables and it will be up to the user to select the appropriate table for their query. This is not practical with more than a few aggregates.
The problem is worse for the custom application designer because queries are embedded in the program. If an aggregate is added or removed, the program must be manually changed to query from the appropriate table. With programmatic interfaces, these issues can be managed by designing the applications to dynamically generate queries against the appropriate table. For packaged query tools the issue is somewhat more complex.
There are products available which will act as intelligent middleware. This provides a single logical view of the schema and hides the aggregates from the user or developer. They operate by examining the query and re-writing it so that it uses the appropriate aggregate table rather than the base fact table. Examples of companies providing this type of software are MicroStrategies, Information Advantage, and Stanford Technologies (now owned by Informix). The logical place for this type of query optimization is in the database itself, but no commercial RDBMS vendor has provided extensions to their products to handle this issue.
In spite of this limitation, I prefer this design for aggregate storage due to the advantages over the other methods. For the remainder of this article I will assume that each aggregate is stored in an individual table.
Storing Aggregate Dimension Rows
A big issue encountered when storing aggregates is how the dimensions will be managed. Normally the dimensions contain one row for each discrete item. For example, a product dimension has a single row for each product manufactured by the company. The question arises, "how do you store information about hierarchies so the fact and aggregate tables are appropriately keyed and queried?"
No matter how the dimensions and aggregates are handled, the aggregate rows will require generated keys. This is because the levels in a dimension hierarchy are not actually elements of the dimension. They are constructs above the detail level within the dimension. This is easily seen if we look at the company geography dimension described in the example for figure 2.
The granularity of the fact table is product by store by day. This means the base level in the geography dimension is the store level. All fact rows will have as part of their key the store key from a row in this dimension. The hierarchy in the dimension is store ® district ® region ® all stores. There is no row available in the dimension table describing a district or region. We must create these rows and derive keys for them. The keys can't duplicate any of the base level keys already present in the dimension table.
This can be done in several ways. The preferred method is to store all of the aggregate dimension records together in a single table. This makes it simple to view dimension records when looking for particular information prior to querying from the fact or aggregate tables.
There is one issue with the column values if all the rows are stored in a single table like this. When adding an aggregate dimension record there will be some columns for which no values apply. For example, a district level row in the geography dimension will not have a store number. This is shown in figure 4.

Figure 4: Storing aggregate dimension rows. Each level above the base store level has keys in a distinct range to avoid conflicts, and all column values are empty for those columns which do not apply at the given level.
When you wish to create a pick list of values for a level in the hierarchy you can issue a SELECT DISTINCT on the column for that level. An alternative to this method is to include a level column which contains a single value for each level in the hierarchy. Then queries for a set of values for a particular level need only select where the level column is the level required.
Other methods for storing the aggregate dimension rows include using a separate table for each level in the dimension, normalizing the dimension, or using one table for the base dimension rows and a separate table for the hierarchy information. The disadvantage of all of these methods is that the dimension is stored in multiple tables, which further complicates the query process.
The first method is conceptually clean because each fact table has a set of dimension tables which is associated only with that table, so all data is available at the same grain. The problem comes when the user is viewing dimension data at one level, and then wants to drill up or down along a hierarchy. Browsing through values in the dimension is extremely complicated. In addition, there are now many more tables and table relationships to maintain. This runs counter to the goal of the dimensional model, which is to simplify access to the data.
Normalizing the dimension is another way to store the hierarchy information. Rather than store values in dimension columns for the different levels of a hierarchy and issue a SELECT DISTINCT on the appropriate column, a key to another table is stored. This table contains just the values for that column. This is not much different from storing the values in the dimension table, and it complicates queries by adding more tables. Again, this runs counter to the goal of simplifying access for both the user and the query optimizer in the database.
Using a single table for the base level dimension rows and a separate table for all aggregate dimension rows has the disadvantage of adding another table. It has an advantage which may make this approach better than using a single dimension table for all rows. If creating non-duplicate key values for the base level dimension rows and the aggregate rows is difficult, storing the aggregate rows in a separate table will make this problem simpler to resolve. The aggregate dimension rows can use a simpler key structure since they are no longer under the column constraints imposed by the base level dimension.
Another topic worth mentioning in the storing of aggregate dimensions is multiple hierarchies in a single dimension. This shows up frequently when initially designing the dimensions, and it has an impact on the aggregates. When a dimension has multiple hierarchies, this implies that the number of possible aggregates will be multiplied by the number of levels in the extra hierarchies. When looking at the number of aggregates you must remember to take into account each hierarchy which exists in a dimension. Multiple hierarchies may create further problems at higher summary levels because values at a low level may be double counted at a higher level.
Places where you will frequently find multiple hierarchies are in customer dimensions, product dimensions, and the time dimension. Products may have several hierarchies depending on whether you are viewing them from a manufacturing, warehousing, or sales perspective. Customer dimensions will sometimes have hierarchies for physical geography, demographic geography, and organizational geography. You might see two hierarchies in the time dimension: one for the calendar year and one for the fiscal year.
Once the method for storing aggregates and their dimension values is chosen, the next step is to create the aggregates. The optimal approach depends on the volume of data, number of aggregates, and parallel capabilities of your database and hardware. Since there is no single best method, I will offer some guidelines on approaches.
Aggregate Creation
There are a number of factors which will help to define the approach. The first is the size of the fact table. It is not uncommon for a fact table to contain hundreds of millions to more than a billion rows of detail data, and to exceed 75 gigabytes in size. This volume of data will limit approaches which require frequent recalculation of the aggregates, or which require multiple scans through the fact table.
The number of aggregates which must be created is a constraint. A typical fact table may have more than fifty aggregate tables in a production system. The number will depend on the fact table size, number of dimensions, and query performance. As the number of aggregates grows, the processing window required for the batch update cycle will increase, possibly spilling over into the online usage period.
Another constraint is the parallel capabilities of the database and computing platform. With the typical volumes of data it is unlikely that a simplistic approach using a single threaded program and no database parallelism will complete in a reasonable amount of time. For very large databases and high numbers of concurrent users, high end symmetric multiprocessing (SMP) or massively parallel (MPP) platforms will be required.
Parallel database performance improvements are impressive, but the usefulness of the technology may be limited. Depending on the database, only certain SQL operations may be parallelized. Most commercial databases have limitations of this type. This can be a serious issue when building an aggregate table.
If you are creating a very large table and the database can't parallelize the INSERT statement then you might be faced with a bottleneck that prevents you from using a simple SQL statement to create the table. In addition, there may be constraints on the query portion of the statement such that only certain types of queries will execute in parallel. This can effective cripple the statement by turning it into a single-threaded access to millions of rows of data.
If the critical path of nightly batch processing can't fully utilize the hardware, parallelism may help alleviate single-stream bottlenecks by allowing certain processes to use more resources and complete sooner. If the aggregate processing has already been parallelized by partitioning the work into multiple application processes then there may be less benefit.
Due the brute-force nature of many parallel implementations, databases have the ability to use all available resources on a server for a single SQL statement. This resource utilization often constrains use of parallel operations to a limited scope. If you try to run more than a handful of operations without constraining them in some way, they will introduce serious contention issues.
Recreating Versus Updating Aggregates
One of the major design choices for the aggregation programs is whether to drop and recreate the aggregate tables during the batch cycle, or update the tables in place. The time to completely regenerate an entire aggregate table is a prime consideration. Some aggregates may be too large or require summarizing of too much data for the regeneration approach to work effectively. Alternatively, regeneration may be more appropriate if there is a lot of program logic to determine what data must be updated in the aggregate table.
Time period-to-date aggregates create their own special set of problems when making the recreate versus update decision. When updating the aggregate, new data which is not yet present in the aggregate table will likely require insertion. This implies a two pass approach in the aggregate program design, where the first pass scans aggregate data to see what should be inserted and what is already present. The second pass updates the existing data, but not the newly inserted data from the first pass.
Updates to the rows can cause database update and query performance issues if the tables are not tuned properly. Updating column values may create problems with internal space allocation if numeric values are stored with a variable length encoding scheme. If a dollar value increases from one to eight digits during several updates to rows in a month-to-date sales table, the database must reallocate for those rows (Oracle refers to this as row-chaining). This will lead to slower update and query performance over time, eventually requiring a table reorganization.
Given the data volumes in a typical decision support database, it will probably be most efficient for aggregation programs to update the aggregate tables with the newly loaded data, rather than dropping and recreating all aggregates. The tradeoff with this approach is the programming complexity. Creating an aggregate table may be as simple a single SELECT statement. Updating the same table may require several passes through the new data and the existing aggregate table.
Single Threaded Versus Multi-threaded Creation
The program to create aggregates can be written to build or update the tables in single threaded fashion (one at a time). If the number of aggregates to generate is limited, data volumes are not very large, there is little concurrent activity on the system, or the processing window is large then generating aggregates in single stream fashion may be practical.
This approach requires less development effort because there is no coordination among processes and there will be few dependencies due to the serial nature of processing. Also, if the queries to create the aggregate can take advantage of database parallelism then all resources on the server may be dedicated to a single process. This will allow the operation to complete in a fraction of the time.
One limitation with this approach is the queries which do not take advantage of parallelism.. If the processing window is not sufficiently large, these processes will become bottlenecks due to the serial nature of the design. Another limitation is the inflexibility of a single batch stream. If dependencies are created or new long-running processes are added, the design may require changes.
If there are many aggregates, multiple period-to-date aggregates, or large volumes of data then a design which allows multiple processes to execute simultaneously will probably be required.
A multi-threaded design must take into account the impact of simultaneous access to data stored in the same table, and the impact of writing aggregated data into the aggregate tables. This will require more detailed knowledge of the data to determine when to schedule the programs. Running too many concurrent programs could create a bottleneck in CPU, memory, or I/O resources on the platform, or cause contention issues in the logging mechanism of the database.
This approach requires more effort to design, and will require more development effort. This is due to the addition of dependencies and constraints on processes and the added monitoring and scheduling required. There will also be fewer opportunities to use the brute-force approach that database parallelism allows with single large operations.
Using a "Cascading" Model to Create Aggregates
A final note on aggregate creation is for aggregates which are one or more levels removed from the base level data. It may be possible to use aggregates stored at lower levels to generate the higher level aggregates, as shown in Figure 5. By using the lower level tables, the program will perform less work to generate successively higher level aggregates. This approach may be taken with either the single-stream or multi-threaded designs.
Figure 5: Tables in a cascading design. The base table is at the grain of product by salesperson by customer by day. The aggregation program creates the first aggregate, in this case summarizing the data to the level of product by salesperson by day. The next aggregate is product by district by month, and is built from the previous aggregate table.
When designing a cascading model like this there are several issues which should be taken into consideration. The addition of dependencies on intermediate processing may result in missing high-level aggregates due to a failure during creation of a lower level table. If there are numerous high-level aggregates then a lower level failure will result in many missing aggregates. If the problem can't be fixed before the users log on to the system, they will suffer from seriously degraded performance for high-level queries.
Error propagation can be an issue with cascading creation. If an error is encountered during aggregation at a lower level, the error will be propagated throughout all higher level aggregates. Correction of the error will require the recalculation of all data at all levels above the level where the error first occurred.
When choosing whether to use a cascading approach, a consideration should be what the system management and software maintenance impact will be versus the available processing window. If there is sufficient time available to process aggregates from the base data, the cascading approach should not be taken because it will be more complex. Strong change management practices and coordinated development are required to avoid spending excessive time solving operational problems when something changes or goes wrong.
Aggregate Maintenance
Once the system is in production a new set of problems will introduce themselves. After the initial rollout of the system some users will experience very long running queries or reports. This may result in the need for some new aggregates to resolve the performance issue. It will also be the case that certain aggregates are rarely used by any query. These are good candidates for removal since they don't do anything other than take up space in the database.
An interesting pattern that I have observed with decision support databases is that over a period of time the usage of the data will change. Users will become more educated about the available information and the questions they ask will evolve. Previously useful aggregates will be used less, and new aggregates will be required to meet the current performance expectations.
This implies continued maintenance of the aggregation programs. If an aggregate is no longer useful and is removed from the system, all associated programs must be updated. If you have specific metadata about aggregates, this must be updated. If you are using a cascading model for creating aggregates, the addition of an aggregate may provide a more efficient base for existing higher level aggregates and suggest changes in how they are generated.
Aggregate maintenance is a mostly manual process and requires monitoring the usage of aggregates by the users. There are some end user access products on the market which include a query monitor component which will collect statistics on the usage of aggregates and on the queries which are executed. In many cases, you must build a monitoring component as part of your decision support system.
This type of data collection is very useful in determining when to add or remove aggregates. Some users will notify the DBA when there is a performance issue with a query. Many will assume there is a problem with their PC, the database, or that this is normal. Without statistics or their feedback there is no way to know if the system is performing adequately.
The most basic information which should be collected in order to monitor aggregate usage is the number of queries against the fact table and the number of queries against each of the aggregate tables. These two data points will indicate which tables are or are not being used. If there are frequent queries against low level aggregates or the fact table, this is an indication that another aggregate may be in order.
Beyond this basic table level information, the following items are very useful if they can be captured:
- column value histograms on the constraint columns in the fact and aggregate tables; this is useful in determining the selectivity of various possible indexes and will influence your indexing strategy.
- histograms for the combination of values at each level in the fact table; this information will help estimate the row counts in any aggregate you might wish to create.
- query parse counts; this is useful only if a fixed query front-end application is used to access the data. This will tell exactly how many times a given set of information was queried. The total count gives an idea of how much the system is being used.
- query duration; this is a simple measure to determine whether there are any excessively long queries which should be examined in further detail to see if they are the result of excessive I/O or poorly written SQL.
- query resource utilization; queries may be highly complex, and therefore slow. There is nothing to do with this type of query except try to rewrite it. If a query has a very large amount of logical I/O but few rows returned then it is reading a large amount of data, and might benefit from an aggregate.
- number of rows retrieved; this is very important because some queries simply return lots of data, and this is the reason for the slow response. If a query returns 10,000 rows, an aggregate is not going to reduce the number of rows returned.
- level of data requested; knowing the levels of data requested along the dimension hierarchies will tell if queries are accessing the correct aggregate level. It may be that they are querying a lower level aggregate due to the absence of an aggregate at the level required.
Conclusion
Some of the techniques mentioned in this article, mainly in the monitoring and analysis space, have been adopted by database and query tool vendors. One rapidly growing product area is the "aggregate aware" query tool arena, which includes vendors like MicroStrategy, Business Objects, and others. Oracle has also added aggregate awareness to its database engine. Aggregate aware tools have the ability to process queries issued against a base-level dimensional schema and select the appropriate aggregate to satisfy the query. Some of the products include a component which can monitor queries and indicate potential candidates for aggregation. One drawback to many of these products is that they are not "open" - many require that if you use their aggregate middleware, you use their aggregation tool.
For the most part, when deciding how to create and maintain aggregates in the database, it will be up to the implementor to determine the optimal approach for their set of constraints. This article only touches the surface of the issues around performance in a large decision support database. I advise people starting complex projects like these to seek professional consulting help from companies with experience in the end-to-end implementation of similar projects, and with a proven record of successful references.
Mark Madsen is a specialist in the data warehouse business with a focus on the analysis and design of large-scale systems. You can reach Mark at mmadsen@clickstreamconsulting.com.
|
Book Endorsements
Book Table of Contents Book Authors Referenced Material
Related Articles Related Links Download a Project Plan Discussion Forum
Links
© Copyright 2001, 2002 Clickstream Consulting, All Rights Reserved
|
|