DB2 Server for VSE & VM: Performance Tuning Handbook

Tuning Queries with Several Tables

The process of combining rows of one table with rows of another is called a join. It is often possible to write a query against two or more tables either as a join or as one or more nested SELECT clauses. The first method is usually more efficient, as this gives the optimizer more choices during access path selection.

The following query retrieves data about all designers in departments that are responsible for projects that are part of a major project MA2100. It will be used here to illustrate different access methods in detail.

   FROM   EMPLOYEE E, PROJECT P
   WHERE  E.JOB = 'DESIGNER'
     AND  E.WORKDEPT = P.DEPTNO
     AND  P.MAJPROJ = 'MA2100'

In this example, the following assumptions are made:

Each table is in its own dbspace
Table EMPLOYEE
- Has 10 000 rows on 500 pages
- Has an index on EMPNO with 25 pages
- Has an index on WORKDEPT with 10 pages
Table PROJECT
- Has 3000 rows on 60 pages
- Has an index on PROJNO with 8 pages
- Has an index on RESPEMP with 8 pages
COLCOUNT for MAJPROJ is 100.
COLCOUNT for WORKDEPT is 1000; for JOB, 50

Methods of Joining Two or More Tables

To join two tables in a single query, the database manager chooses the less costly of a nested loop join and a merge scan join. The two methods are described below.

Nested Loop Join (Type 1)

In nested loop joins, the rows of one table (the "outer" table) are retrieved one by one. Sargable and residual predicates are applied to eliminate unqualified rows. For each qualified row of the outer table, the database manager opens a cursor on the second table (the "inner" table), and retrieves all rows that satisfy both the join predicate connecting the two tables and any local predicates on the inner table.

Either table can be scanned by a dbspace or index scan. The outer table is scanned once, while the inner table is scanned as many times as the number of qualifying rows in the outer table. Hence, the nested loop join is most efficient when the inner table has an efficient access path, and when only a few rows of the outer table remain after applying predicates to it. For a nested loop join:

  Join cost = cost of outer table scan
            + ((Estimated number of qualifying records in outer table)  x
               (cost of inner table scan))

If the inner table is small enough to fit into its share of the buffer pool, the database manager anticipates that the entire inner table will remain in buffers throughout the operation. On this assumption, the I/O cost in the second term of the join cost is estimated as no more than the cost of scanning the inner table once. The nested loop join is illustrated in Figure Figure 15.

Figure 15. Nested Loop Join

Merge Scan Join (Type 2)

For this method, there must be one or more predicates of the form TABLE1.COL1 = TABLE2.COL2, where the two columns have the same data type and length attribute. One of the predicates is chosen as the merge join predicate. The approach is to scan both tables in the order of the merge join columns, and to merge the result together whenever matching rows are found.

If the outer table has no efficient index on the join columns, an intermediate table is built by sorting the outer table on the join columns, applying any local predicates, and eliminating unused columns. The inner table is handled similarly. The database manager then reads the first row of both ordered tables (applying any predicates that remain). If the merge join predicate matches, the database manager returns the combined result. It then reads the next row of the inner table, which might match the same row of the outer table, and continues to read rows from the inner table and return the results until the merge join predicate fails to match. When there is no longer a match, the database manager reads the next outer table row. If that row has the same join predicate value, the database manager goes back and reads the matching group of records from the inner table again. If the outer row has a new join predicate value, the database manager searches ahead in the inner table until it finds either:

A matching inner row, in which case the matching process is repeated.
An inner row with a higher value than the join predicate. Then the database manager discards the unmatched row of the outer table and searches through the outer table until either a matching row or a row with a higher join predicate value is found. If a matching row is found, the matching process is again repeated. If a higher join predicate is found, the search moves back into the inner table.

Hence, for a merge scan join,

   Join cost = cost of outer table scan
             + cost of sorting and reading outer table (if needed)
             + cost of inner table scan
             + cost of sorting and reading inner table (if needed)

If an efficient index does exist on the join column, and this index has been used to retrieve the rows of either table, the rows of that table are already in sequence. In this case, no sort of the table is required.

Merge scan join is illustrated in Figure 16.

Figure 16. Merge Scan Join

Choosing an Access Method

When choosing an access method for the query shown on page Figure 15, which entails a simple join of only two tables, four access methods must be considered: the nested loop join and the merge scan join, each with both possible choices of the outer and inner table. The choice will be based on a comparison of their costs.

Nested Loop, with EMPLOYEE as Outer Table

With no index on JOB, a scan occurs on all 10000 (500 pages) rows of the employee table. For each row, it first evaluates the predicate JOB = 'DESIGNER'; an estimated 200 (1/50 x 10000) rows remain. For each of those 200 rows, a scan takes place on all 3000 rows (60 pages) of the project table to find rows with WORKDEPT = DEPTNO and MAJPROJ = 'MA2100'. The major costs are:

   Join cost = 10 000 row scan (500 page I/Os)
             + (200 x 60 page I/Os)               = 12 500 page I/Os

Nested Loop, with PROJECT as Outer Table

With no index on MAJPROJ, the database manager must scan all 3000 rows of the project table. For each row, it first evaluates the predicate MAJPROJ='MA2100'; an estimated 30 (1/100 x 3000) rows remain. For each of those 30 rows, the database manager must find all the rows in the EMPLOYEE table with WORKDEPT = DEPTNO and JOB = 'DESIGNER'. But instead of scanning the entire employee table, it can use the index on WORKDEPT each time. Each department has an average of 10 employees (10 000 rows / 1000 distinct departments), so the I/O cost is 1 index leaf page and 10 data pages for each value of WORKDEPT. The major costs are:

   Join cost = 3000 row scan (60 page I/Os)
             + (30 x 11 page I/Os)                 = 390 page I/Os

Merge Scan, with EMPLOYEE as Outer Table

Here the database manager reads the project table and applies the local predicate MAJPROJ = 'MA2100'. The remaining 30 rows are sorted and placed into a temporary table (a write and read of 1 page). Then, the database manager reads the employee table in department number order using the index on WORKDEPT (10 index pages + 500 data pages). For each row of the employee table, the database manager first evaluates the predicate JOB = 'DESIGNER'; if the row qualifies, it reads rows in the inner (temporary) table that match on department number. The costs are:

   Join cost = cost of outer table index scan        510 page I/Os
             + cost of accessing inner table          60 page I/Os
             + cost of sorting and reading inner       2 page I/Os
 
                                                   = 572 page I/Os

Merge Scan, with PROJECT as Outer Table

If the outer table is the project table, it cannot be accessed in department number order, and thus must be sorted. Before sorting, the database manager eliminates the rows that do not satisfy MAJPROJ = 'MA2100', leaving 30 rows. The estimated costs are:

   Join cost = cost of outer table scan               60 page I/Os
             + cost of sorting and reading outer       2 page I/Os
             + cost of accessing inner table         510 page I/Os
 
                                                   = 572 page I/Os

The nested loop scan with the project table as the outer table would be the best choice here. Only this access path makes use of the employee table index on WORKDEPT to avoid a scan of the entire table while not requiring a sort of the project table.

The two merge scan costs are shown as identical here because CPU costs and the presence of duplicates have been ignored. In reality, both would affect the actual cost calculated by the database manager.

Multiple Joins

Multiple joins are performed by logically joining two tables at a time, using either the nested loop or merge scan join method. This does not mean that each join necessarily produces an actual intermediate table; in a number of cases, no intermediate tables are required.

For example, a query joining tables T1, T2, and T3 may use a nested loop join to join T1 to T2, and a merge scan join to join the logical result (T1-T2) to T3. In this case, an intermediate or composite table might be created, and would appear in PLAN_TABLE if you examined the query with EXPLAIN. In addition, the composite table may require sorting to participate in the merge scanjoin. T3 may also require sorting (unless an index exists over the join columns). If the query has no ORDER BY clause, requiring sorting of the final result, then the T1-T2 composite table can be joined (using merge scan) with T3, returning each qualifying result row as it is found.

In short, whether composite tables are formed or not depends on a number of factors, including the SQL statement in question, the availability and type of indexes, and the catalog statistics. Any number of access plans can produce the correct answer, but the optimizer will choose the lowest estimated total cost solution.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]