Administration Guide

Automatic Summary Tables

Summary tables are a powerful way to improve query response time. In many environments where some of the basic query structures can be anticipated summary tables can be used to:

Aggregate data over one or more dimensions
Join and aggregate data over a group of tables
Identify a commonly accessed subset of data (that is, a "hot" horizontal or vertical partition)
Repartition a table, or part of a table, in a partitioned database environment

Knowledge of summary tables is integrated into the SQL Compiler. Within the SQL Compiler, Query Rewrite (see Rewrite Query by the SQL Compiler) and the Optimizer (see Data Access Concepts and Optimization) are involved in matching queries with summary tables in and determining whether to substitute a summary table for a query over base tables. Whenever summary tables are used to answer queries the EXPLAIN facilities (see Chapter 26, SQL Explain Facility) can be used to determine which summary table was selected. Since summary tables behave like regular tables in many ways, the same considerations for optimizing data access using tablespace definitions, creating indexes, and issuing RUNSTATS apply to summary tables.

To help you understand the power of summary tables we provide the following example of a multidimensional analysis query and show how it takes advantage of summary tables.

In this example, we assume a scenario where a warehouse contains a set of customers and a set of credit card accounts. The warehouse records the set of transactions that are made with the credit cards. Each transaction contains a set of items that are purchased together. We can categorize this environment as a multi-star because two tables, the one containing transaction items and the other identifying the purchase transactions, are large and together are the hub of the star.

There are three hierarchical dimensions that describe a transaction: product, location, and time. The product hierarchy is recorded in two normalized tables representing the product group and the product line. The location hierarchy contains city, state, and country information and is represented in a single de-normalized table. The time hierarchy contains day, month, and year information and is encoded in a single date field. The date dimensions are extracted from the date field of the transaction using built-in functions. There are also other tables in this scenario that represent account information for customers and customer information.

A summary table is created with the sum and count of sales for each level of:

Product hierarchy
Location hierarchy
Time hierarchy, composed of year, month, day.

A wide range of queries can pick up their answers from this stored aggregate data. The following example computes sum and count of sales along the product group and line dimensions; along the city, state, and country dimension; and along the time dimension. It also includes several other columns in its GROUP BY clause.

  CREATE TABLE dba.PG_SALESSUM
    AS (
        SELECT l.id AS prodline, pg.id AS pgroup,
               loc.country, loc.state, loc.city, 
               l.name AS linename, pg.name AS pgname,
               YEAR(pdate) AS year, MONTH(pdate) AS month,
               t.status,
               SUM(ti.amount) AS amount,
               COUNT(*) AS count
        FROM   cube.transitem AS ti, cube.trans AS t,
               cube.loc AS loc, cube.pgroup AS pg,
               cube.prodline AS l
        WHERE  ti.transid = t.id
           AND ti.pgid = pg.id
           AND pg.lineid = l.id
           AND t.locid = loc.id
           AND YEAR(pdate) > 1990
        GROUP BY l.id, pg.id, loc.country, loc.state, loc.city,
                 year(pdate), month(pdate), t.status, l.name, pg.name
       )
  DATA INITIALLY DEFERRED REFRESH DEFERRED;
 
  REFRESH TABLE dba.SALESCUBE;

The summary table is typically much smaller than the base fact tables. You can control when the summary table is refreshed by specifying the DEFERRED option (as shown in our example).

Queries that can take advantage of such pre-computed sums would include:

Sales by month and product group
Total sales for years after 1990
Sales for 1995 or 1996
Sum of sales for a product group or product line
Sum of sales for a specific product group or product line AND for 1995, 1996
Sum of sales for a specific country.

While the precise answer is not included in the summary table for any of these queries, the cost of computing the answer using the summary table could be significantly less than using a large base table, because a portion of the answer is already computed. Expensive joins, sorts, and aggregation of base data is avoided or reduced through summary tables.

The following are sample queries that would obtain significant performance improvements because they are able to use the results in the summary table that are already computed. The first example returns the total sales for 1995 and 1996:

  SET CURRENT REFRESH AGE=ANY
 
  SELECT YEAR(pdate) AS year, SUM(ti.amount) AS amount
  FROM   cube.transitem AS ti, cube.trans AS t,
         cube.loc AS loc, cube.pgroup AS pg,
         cube.prodline AS l
  WHERE  ti.transid = t.id
     AND ti.pgid = pg.id
     AND pg.lineid = l.id
     AND t.locid = loc.id
     AND YEAR(pdate) IN (1995, 1996)
  GROUP BY year(pdate);

The second example returns the total sales by product group for 1995 and 1996:

  SET CURRENT REFRESH AGE=ANY
 
  SELECT pg.id AS "PRODUCT GROUP",
         SUM(ti.amount) AS amount
  FROM   cube.transitem AS ti, cube.trans AS t,
         cube.loc AS loc, cube.pgroup AS pg,
         cube.prodline AS l
  WHERE  ti.transid = t.id
     AND ti.pgid = pg.id
     AND pg.lineid = l.id
     AND t.locid = loc.id
     AND YEAR(pdate) IN (1995, 1996)
  GROUP BY pg.id;

Larger improvements in response time for such queries can be achieved with larger databases. This happens because the summary table grows slower than the growth of the base table. One advantage of summary tables is that DB2 Universal Database uses them to effectively eliminate overlapping work among queries by doing the computation once when building the summary tables and reusing their content for a very large number of queries.

[ Top of Page | Previous Page | Next Page ]