Administration Guide

Accounting for Column Correlation

You may have applications which contain queries constructed with joins that have more than one join predicate joining two tables. While this may sound complicated, such a situation is not unusual where you are attempting to determine relationships between similar, related columns between tables.

For example, a manufacturer makes products from raw material of various colors, elasticities and qualities. The finished product has the same color and elasticity as the raw material from which it is made. The manufacturer issues the query:

     SELECT PRODUCT.NAME, RAWMATERIAL.QUALITY FROM PRODUCT, RAWMATERIAL
        WHERE PRODUCT.COLOR       =  RAWMATERIAL.COLOR
          AND PRODUCT.ELASTICITY  =  RAWMATERIAL.ELASTICITY

This query returns the names and raw material quality of all products. There are two join predicates:

   PRODUCT.COLOR       =  RAWMATERIAL.COLOR
   PRODUCT.ELASTICITY  =  RAWMATERIAL.ELASTICITY

When the DB2 UDB optimizer chooses a plan for executing this query, it calculates how selective each of the two predicates are, and assumes that they are independent, that is, that all variations of elasticity occur for each color, and that conversely for each level of elasticity there is raw material of every color. It then uses statistics on how many levels of elasticity and how many different colors there are in each table to calculate the overall selectivity of the pair of predicates. Based on this it may choose, for example, a Nested Loop Join in preference to a Merge Join, or vice versa.

However, it may be that these two predicates are not independent. For example, it may be that the highly elastic materials are available in only a few colors, and the very inelastic materials are only available in a few other colors (different from the elastic ones). Then the combined selectivity of the predicates is less (eliminates fewer rows) so the query will return more rows. To see this, imagine the extreme case where there is just one level of elasticity for each color and vice versa. Now either one of the predicates logically could be omitted entirely since it is implied by the other. The optimizer's choice of plan may no longer be the best, for example it may be that the Nested Loop join plan is selected but the Merge Join would be faster.

With other database products, database administrators have tried to solve this performance problem by updating statistics in the catalog to try to make one of the predicates appear to be less selective, but this approach can cause unwanted side-effects on other queries.

DB2 UDB's optimizer attempts to detect and compensate for correlation of join predicates if you:

  1. Define unique indexes on the correlated columns, that is, on the columns of a table which appear in the correlated predicates.
  2. Do not set the registry variable DB2_CORRELATED_PREDICATES to "NO".

In the above example, you could define a unique index covering either:

   PRODUCT.COLOR, PRODUCT.ELASTICITY

or

   RAWMATERIAL.COLOR, RAWMATERIAL.ELASTICITY

or both.

In order for correlation to be detected, the non-include columns of this index must be correlated columns, and no other columns. The index may optionally contain include columns.

In general there may be more than 2 correlated columns in join predicates so you should ensure that you define the unique index to cover all of them.

In many cases the correlated columns in one table form its primary key. A primary key is always unique so if there's a primary key on the correlated columns, there's no need to define another unique index.

After doing this, ensure that statistics on tables are up to date and that they have not been altered away from the true values for any reason, for example to attempt to influence the optimizer.

The optimizer will use the FIRSTnKEYCARD and FULLKEYCARD information of the unique index statistics to detect cases of correlation, and dynamically adjust combined selectivities of the correlated predicates, thus obtaining a more accurate estimate of the join size and cost.

In addition to JOIN predicate correlation, the optimizer also accounts for correlation with simple equal predicates of the type COL = "constant". For example, consider a table of different types of cars, each having a MAKE (that is, a manufacturer), MODEL, STYLE (that is, sedan, station wagon, sports utility vehicle), YEAR, and COLOR. Predicates on COLOR are likely to be independent of those on MAKE, MODEL, STYLE, or YEAR, since almost every manufacturer makes the same standard colors available on each of their models and styles, year after year. However, the predicates MAKE and MODEL certainly are not independent since only a single car maker would make a model with a particular name. Identical model names used by two or more car makers is very unlikely and certainly not wanted by the car makers. If an index exists on the two columns MAKE and MODEL, the optimizer will use the statistics from the index to determine the combined number of distinct values and adjust the selectivity or cardinality estimation for correlation between the two columns. For such predicates which are not join predicates, one need not have a unique index for the optimizer to make the adjustment.


[ Top of Page | Previous Page | Next Page ]