Administration Guide

Federated Database Query Compiler Phases

This section describes additional query processing phases in a federated database system. It also provides recommendations for improving federated database query performance. Major topics include:

"Pushdown Analysis"
"Remote SQL Generation and Global Optimization".

Pushdown Analysis

Pushdown analysis tells the DB2 optimizer if an operation can be performed at a remote data source. An operation can be a function, such as relational operator, system or user functions, or an SQL operator (GROUP BY, ORDER BY, and so on).

Functions that cannot be pushed-down can significantly impact query performance. Consider the effect of forcing a selective predicate to be evaluated locally instead of at the data source. This approach could require DB2 to retrieve the entire table from the remote data source and then filter it locally against the predicate. If your network is constrained--and the table is large--query performance could suffer.

Operators that are not pushed-down can also significantly impact query performance. For example, having a GROUP BY operator aggregate remote data locally could, once again, require DB2 to retrieve the entire table from the remote data source.

As an example, assume that nickname N1 references the data source table EMPLOYEE in a DB2 for OS/390 data source. Further, assume that the table has 10,000 rows, one of the columns contains the last names of employees, and one of the columns contains salaries. Given the statement:

   SELECT LASTNAME, COUNT(*)  FROM N1
      WHERE LASTNAME  > 'B' AND SALARY > 50000
      GROUP BY LASTNAME;

several possibilities are considered:

If the collating sequences at DB2 and DB2 for OS/390 are the same, it is likely that the query predicate will be pushed-down to DB2 for OS/390. It is usually more efficient to filter and group results at the data source instead of copying the entire table to DB2 and performing the operations locally. Pushdown analysis in federated systems determines if operations can be performed at the data source. In this case, the predicate and the GROUP BY operation can take place at the data source.
If the collating sequence is not the same, pushdown analysis will determine that the entire predicate cannot be evaluated at the data source; however, the optimizer may decide to pushdown the SALARY > 1000 portion of the predicate. The range comparison must still be done at DB2.
If the collating sequence is the same, and the optimizer knows that the local DB2 server is very fast, it is possible that the optimizer will decide that performing the GROUP BY operation locally at DB2 is the best (least cost) approach. The predicate will be evaluated at the data source. This is an example of pushdown analysis combined with global optimization. DB2 will consider the available paths and then choose a plan that is the most efficient.

In general, the goal is to ensure that functions and operators can be considered for evaluation on data sources by the optimizer. Many factors can affect whether a function or an SQL operator is evaluated at a remote data source. The key factors are discussed in three groups: server characteristics, nickname characteristics, and query characteristics.

Server Characteristics Affecting Pushdown Opportunities

The following sections contain data source-specific factors that can affect pushdown opportunities. In general, these factors exist because DB2 lets you use a rich SQL dialect to submit queries. This dialect may offer more functionality than the SQL dialect supported by a server accessed during a DB2 query. DB2 can compensate for the lack of function at a data server, but doing so may require that the operation take place at DB2.

SQL Capabilities

Each data source supports a variation of the SQL dialect and different levels of functionality. For example, consider the GROUP BY list. Most data sources support the GROUP BY operator; but, some have restrictions on the number of items on the GROUP BY list or restrictions on whether an expression is allowed on the GROUP BY list. If there is a restriction at the remote data source, DB2 might have to perform the GROUP BY operation locally.

SQL Restrictions

Each data source can have different SQL restrictions. For example, some data sources require parameter markers to bind in values to remote SQL statements. Therefore, parameter marker restrictions must be checked to ensure that each data source can support such a bind mechanism. If DB2 cannot determine a good method to bind in a value for a function, this function must be evaluated locally.

SQL Limits

DB2 might allow the use of larger integers than its remote data sources; however, limit-exceeding values cannot be embedded in statements sent to data sources. Therefore, the function or operator that operates on this constant must be evaluated locally.

Server Specifics

Several factors fall into this category. One example is sorting NULL values (highest, lowest, or depending on the ordering). For example, if the NULL value is sorted at a data source differently from DB2, ORDER BY operations on a nullable expression cannot be remotely evaluated.

Collating Sequence

Configuring a federated database to use the same collating sequence that a data source uses and then setting the collating_sequence server option to 'Y' allows the optimizer to consider "pushing-down" character range comparison predicates.

When a query from a federated server requires sorting, the place where the sorting is processed depends on several factors. If the federated database's collating sequence is the same as that of the data source where the queried data is stored, the sort may take place at the data source. If collating sequences are the same, the optimizer can decide if a local sort or a sort at the data source is the most efficient way to complete the query. Likewise, if a query requires a comparison of character data, this comparison can also be performed at the data source.

Numeric comparisons, in general can be done at either location even if the collating sequence is different. You may get unusual results, however, if the weighting of null characters is different between the federated database and the data source. Likewise, for comparison statements, be careful if you are submitting statements to a case-insensitive data source. The weights assigned to the characters "I" and "i" in a case-insensitive data source are the same. DB2, by default, is case sensitive and would assign different weights to the characters.

If the collating sequences of the federated database and the data source differ, DB2 retrieves the data to the federated database, so that it can do the sorting and comparison locally. The reason is that users expect to see the query results ordered according to the collating sequence defined for the federated server; by ordering the data locally, the federated server ensures that this expectation is fulfilled.

Retrieving data for local sorts and comparisons usually decreases performance. Therefore, consider configuring the federated database to use the same collating sequences that your data sources use. That way, performance might increase, because the federated server can allow sorts and comparisons to take place at data sources. For example, in DB2 UDB for OS/390, sorts defined by ORDER BY clauses are implemented by a collating sequence based on an EBCDIC code page. If you want to use the federated server to retrieve DB2 for OS/390 data sorted in accordance with ORDER BY clauses, it is advisable to configure the federated database so that it uses a predefined collating sequence based on the EBCDIC code page.

If the collating sequences at the federated database and the data source differ, and you need to see the data ordered in the data source's sequence, you can submit your query in pass-through mode, or define the query in a data source view.

See the Administration Guide: Planning for more information about collating sequences and how to set them; see Table 43 for more information about the collating_sequence server option.

Server Options

Several server options can affect pushdown opportunities. In particular, review your settings for collating_sequence, varchar_no_trailing_blanks, and pushdown. See "Server Options Affecting Federated Database Queries" for information on setting these options.

DB2 Type Mapping and Function Mapping Factors

The default local data type mappings provided by DB2 (see the Application Development Guide for data type tables) are designed so that sufficient buffer space is given to each data source data type (to avoid loss of data). A user can choose to customize the type mapping for a specific data source to suit specific applications. For example, if you are accessing an Oracle data source column with a DATE data type (which by default is mapped to the DB2 TIMESTAMP data type), you could change the local data type to the DB2 DATE data type.

DB2 can compensate for functions not supported by a data source. There are three cases where function compensation will occur:

This function simply does not exist at the remote data source.
The function does exist; however, the characteristics of the operand violate function restrictions. An example of this situation is the IS NULL relational operator. Most data sources support it, but some may have restrictions, such as only allowing a column name on the left hand side of the IS NULL operator.
The function, if evaluated remotely, may return a different result. An example of this situation is the '>' (greater than) operator. For those data sources with different collating sequences, the greater than operator may return different results than if it is evaluated locally by DB2.

Nickname Characteristics Affecting Pushdown Opportunities

The following sections contain nickname-specific factors that can affect pushdown opportunities.

Local Data Type of a Nickname Column

Ensure that the local data type of a column does not prevent a predicate from being evaluated at the data source. As mentioned earlier, the default data type mappings are provided to avoid any possible overflow. However, a joining predicate between two columns of different lengths might not be considered at the data source whose joining column is shorter, depending on how DB2 binds in the longer column. This situation can affect the number of possibilities in a joining sequence evaluated by the DB2 optimizer. For example, Oracle data source columns created using the INTEGER or INT data type are given the type NUMBER(38). A nickname column for this Oracle data type will be given the local data type FLOAT because the range of a DB2 integer is from 2**31 to (-2**31)-1, which is roughly equal to NUMBER(9). In this case, joins between a DB2 integer column and an Oracle integer column cannot take place at the DB2 data source (shorter joining column); however, if the domain of this Oracle integer column can be accommodated by the DB2 INTEGER data type, change its local data type with the ALTER NICKNAME statement so that the join can take place at the DB2 data source.

Column Options

The ALTER NICKNAME SQL statement can be used to add or change column options for nicknames.

One of these options is "varchar_no_trailing_blanks". It can be used to identify a column that contains no trailing blanks. The compiler pushdown analysis step will then take this information into account when checking all operations performed on columns so indicated. Based on this indication, DB2 may generate a different but equivalent form of a predicate to be used in the remote SQL statement sent to a data source. A user might see a different predicate being evaluated against the data source, but the net result should be equivalent.

Another column option is numeric_string. Use this option to indicate if the values in that column are always numbers without trailing blanks.

See Table 50 for column option values and defaults.

Table 50. Column Options and Their Settings

Option Valid Settings Default Setting
numeric_string

'Y'
Yes, this column contains only strings of numeric data. IMPORTANT: If this column contains only numeric strings followed by trailing blanks, it is inadvisable to specify 'Y'.
'N'
No, this column is not limited to strings of numeric data.

By setting numeric_string to 'Y' for a column, you are informing the optimizer that this column contains no blanks that could interfere with sorting of the column's data. This option is helpful when the collating sequence of a data source is different from DB2. Columns marked with this option will not be excluded from local (data source) evaluation because of a different collating sequence.
'N'
varchar_no_trailing_blanks Indicates whether trailing blanks are absent from a specific VARCHAR column:

'Y'
Yes, trailing blanks are absent from this VARCHAR column.
'N'
No, trailing blanks are not absent from this VARCHAR column.

If data source VARCHAR columns contain no padded blanks, then the optimizer's strategy for accessing them depends in part on whether they contain trailing blanks. By default, the optimizer "assumes" that they actually do contain trailing blanks. On this assumption, it develops an access strategy that involves modifying queries so that the values returned from these columns are the ones that the user expects. If, however, a VARCHAR column has no trailing blanks, and you let the optimizer know this, it can develop a more efficient access strategy. To tell the optimizer that a specific column has no trailing blanks, specify that column in the ALTER NICKNAME statement (for syntax, see the SQL Reference).
'N'

Option	Valid Settings	Default Setting
numeric_string	'Y' Yes, this column contains only strings of numeric data. IMPORTANT: If this column contains only numeric strings followed by trailing blanks, it is inadvisable to specify 'Y'. 'N' No, this column is not limited to strings of numeric data. By setting numeric_string to 'Y' for a column, you are informing the optimizer that this column contains no blanks that could interfere with sorting of the column's data. This option is helpful when the collating sequence of a data source is different from DB2. Columns marked with this option will not be excluded from local (data source) evaluation because of a different collating sequence.	'N'
varchar_no_trailing_blanks	Indicates whether trailing blanks are absent from a specific VARCHAR column: 'Y' Yes, trailing blanks are absent from this VARCHAR column. 'N' No, trailing blanks are not absent from this VARCHAR column. If data source VARCHAR columns contain no padded blanks, then the optimizer's strategy for accessing them depends in part on whether they contain trailing blanks. By default, the optimizer "assumes" that they actually do contain trailing blanks. On this assumption, it develops an access strategy that involves modifying queries so that the values returned from these columns are the ones that the user expects. If, however, a VARCHAR column has no trailing blanks, and you let the optimizer know this, it can develop a more efficient access strategy. To tell the optimizer that a specific column has no trailing blanks, specify that column in the ALTER NICKNAME statement (for syntax, see the SQL Reference).	'N'

Query Characteristics Affecting Pushdown Opportunities

A query can reference an SQL operator that might involve nicknames from multiple data sources. When DB2 must combine the results from two referenced data sources using one operator, such as a set operator (e.g. UNION), the operation must take place at DB2. The operator cannot be evaluated at a remote data source directly.

Analyzing and Understanding Pushdown Analysis Decisions

Rewriting SQL statements can provide additional pushdown opportunities for DB2 query processing. This section introduces tools for determining where a query is evaluated, lists common questions (and suggested areas to investigate) associated with query analysis, and closes with a brief section about data source upgrades.

Analyzing Where a Query is Evaluated

There are two utilities provided with DB2 that show where queries are evaluated:

Visual explain. Start it with the db2cc or the db2vexp command. Use it to view the query access plan graph. The execution location for each operator is included in the detailed display of an operator.
If a query is completely pushed down, you should see a RETURN operator on top of an RQUERY operator. The RETURN operator is a standard DB2 operator; the RQUERY operator is unique to federated database operations. RQUERY sends an SQL SELECT statement to a data source to retrieve the query result. The SELECT statement is generated using the SQL dialect supported by the data source. It can contain any valid query for that data source.
SQL explain. Start it with the db2expln or the dynexpln command. Use it to view the access plan strategy as text.

Understanding Why a Query is Evaluated at a Data Source or at DB2

This section lists typical plan analysis questions and areas to investigate to increase pushdown opportunities. Key questions include:

Why isn't this predicate being evaluated remotely?
This question arises when a predicate is very selective and thus could be used to filter rows and reduce network traffic. Remote predicate evaluation also affects whether a join between two tables of the same data source can be evaluated remotely.
Areas to examine include:
- Subquery predicates. Does this predicate contain a subquery that pertains to another data source? Does this predicate contain a subquery involving an SQL operator that is not supported by this data source? Not all data sources support set operators in a predicate.
- Predicate functions. Does this predicate contain a function that cannot be evaluated by this remote data source? Relational operators are classified as functions.
- Predicate bind requirements. Does this predicate, if remotely evaluated, require bind-in of some value? If so, would it violate SQL restrictions at this data source?
- Global optimization. The optimizer may have decided that local processing is more cost effective. See Remote SQL Generation and Global Optimization for more information.
Why isn't the GROUP BY operator evaluated remotely?
There are several areas you can check:
- Is the input to the GROUP BY operator evaluated remotely? If the answer is no, examine the input.
- Does the data source have any restrictions on this operator? Examples include:
  - Limited number of GROUP BY items
  - Limited byte counts of combined GROUP BY items
  - Column specification only on the GROUP BY list
- Does the data source support this SQL operator?
- Global optimization. The optimizer may have decided that local processing is more cost effective. See Remote SQL Generation and Global Optimization for more information.
Why isn't the set operator evaluated remotely?
There are several areas you can check:
- Are both of its operands completely evaluated at the same remote data source? If the answer is no and it should be yes, examine each operand.
- Does the data source have any restrictions on this set operator? For example, are large objects or long fields valid input for this specific set operator?
Why isn't the ORDER BY operation evaluated remotely?
Consider:
- Is the input to the ORDER BY operation evaluated remotely? If the answer is no, examine the input.
- Does the ORDER BY clause contain a character expression? If yes, does the remote data source not have the same collating sequence as DB2?
- Does the data source have any restrictions on this operator? For example, is there a limited number of ORDER BY items? Does the data source restrict column specification to the ORDER BY list?

Data Source Upgrades and Customization

Although the DB2 SQL compiler has much information about data source SQL support, this data may need adjustment over time because data sources can be upgraded and/or customized. In such cases, make enhancements known to DB2 by changing local catalog information. Use DB2 DDL statements (such as CREATE FUNCTION MAPPING and ALTER SERVER) to update the catalog. See the SQL Reference for more information.

Remote SQL Generation and Global Optimization

This phase helps produce a globally optimal access strategy to evaluate a query. For a federated database query, the access strategy may involve breaking down the original query into a set of remote query units and then combining the results.

Using the output of pushdown analysis as a recommendation, the optimizer decides whether each operation will be evaluated locally at DB2 or remotely at a data source. The decision is based on the output of its cost model, which includes not only the cost to evaluate the operation but also the cost to transmit the data or messages between DB2 and data sources.

The goal is to produce an optimized query; however, many factors can affect the output from global optimization and thus affect query performance. The key factors are discussed in two groups: server characteristics and nickname characteristics.

Server Characteristics/Options Affecting Global Optimization

Data source server factors that can affect global optimization include the:

Relative ratio of CPU speed
Use the cpu_ratio server option to indicate how much faster or slower the data source CPU speed is compared with the DB2 CPU. A low ratio indicates that the data source workstation CPU is faster than the DB2 workstation CPU. With low ratios, the DB2 optimizer is more likely to consider pushing-down CPU-intensive operations to the data source. See Server Options Affecting Federated Database Queries for more information about this ratio.
Relative ratio of I/O speed
Use the io_ratio server option to indicate how much faster or slower the data source system I/O speed is compared with the DB2 system. A low ratio indicates that the data source workstation I/O speed is faster than the DB2 workstation I/O speed. For low ratios, the DB2 optimizer will consider pushing-down I/O-intensive operations to the data source. See Server Options Affecting Federated Database Queries for more information about this ratio.
Communication rate between DB2 and the data source
Use the comm_rate server option to indicate network capacity. Low rates (indicating a slow network communication between DB2 and the data source) encourage the DB2 optimizer to reduce the number of messages sent to or from this data source. If the rate is set to 0, the optimizer produces a query requiring minimal network traffic. See Server Options Affecting Federated Database Queries for more information about this ratio.
Data source collating sequence
Use the collating_sequence server option to indicate if a data source collating sequence matches the local DB2 database collating sequence. If this option is not set to 'Y', the optimizer considers the data retrieved from this data source as unordered. See Collating Sequence for more information about collating sequence performance issues.
Remote plan hints
Use the plan_hints server option to indicate if plan hints are supported at a data source. Plan hints are statement fragments that provide extra information for data source optimizers. This information can, for certain query types, improve query performance. The plan hints can help the data source optimizer decide whether to use an index, which index to use, or which table join sequence to use.
If plan hints are enabled, the query sent to the data source contains additional information. For example, a statement sent to an Oracle optimizer with plan hints could look like this:
```
   SELECT /*+ INDEX (table1, t1index)*/
      col1
      FROM table1
 
```
The plan hint is the string /*+ INDEX (table1, t1index)*/.
Information in the DB2 optimizer knowledge base
DB2 has an optimizer knowledge base that contains data about native data sources. The DB2 optimizer does not generate remote access plans that cannot be generated by specific DBMSs. In other words, DB2 avoids generating plans that optimizers at remote data sources cannot understand or accept.

Nickname Characteristics Affecting Global Optimization

The following sections contain nickname-specific factors that can affect global optimization.

Index Considerations

DB2 can use information about indexes at data sources to optimize queries. For this reason, it is important that the index information available to DB2 is current. The index information for nicknames is initially acquired at create nickname time. Index information is not gathered for view nicknames.

Creating Index Specifications on Nicknames

You can create an index specification for a nickname. Index specifications build an index definition (not an actual index) in the catalog for use by the DB2 optimizer. Use the CREATE INDEX SPECIFICATION ONLY statement to create index specifications. The syntax for creating an index specification on a nickname is similar to the syntax for creating an index on a local table. See the Administration Guide: Planning for more information.

Consider creating index specifications when:

DB2 is unable to retrieve any index information from a data source during nickname creation.
You want an index for a view nickname.
You want to encourage the DB2 optimizer to use a specific nickname as the inner table of a nested loop join. The user can create an index on the joining column if none exists.

Consider your needs before issuing CREATE INDEX statements against a nickname for a view. In one case, if the view is a simple SELECT on a table with an index, creating indexes on the nickname (locally) that match the indexes on the table at the data source can significantly improve query performance. However, if indexes are created locally over views that are not simple select statements (for example, a view created by joining two tables), query performance may suffer. For example, if an index is created over a view that is a join of two tables, the optimizer may choose that view as the inner element in a nested loop join. The query will have poor performance because the join will be evaluated several times. An alternative is to create nicknames for each of the tables referenced in the data source view and create a local view at DB2 that references both nicknames.

Catalog Statistics Considerations

Catalog statistics describe the overall size of nicknames and the range of values in associated columns. They are used by the optimizer when calculating the least cost path for processing queries containing nicknames. Nickname statistics are stored in the same catalog views as table statistics. See Chapter 24, System Catalog Statistics and Rules for Updating Table and Nickname Statistics for more information about statistic types and how to update them locally.

While DB2 can retrieve the statistical data held at a data source, it cannot automatically detect updates to existing statistical data at data sources. Furthermore, DB2 has no mechanism for handling object definition or structural changes (adding a column) to objects at data sources. If the statistical data or structural data for an object has changed, you have two choices:

Run the equivalent of RUNSTATS at the data source. Then, drop the current nickname. Re-create the nickname. Use this approach if structural information has changed.
Manually update the statistics in the SYSSTAT.TABLES view. This approach requires fewer steps but it will not work if structural information has changed.

Analyzing and Understanding Global Optimization Decisions

This section introduces tools for analyzing query optimization and presents common questions (and suggested areas to investigate) associated with query optimization.

Analyzing Query Optimization

There are two utilities provided with DB2 that show global access plans:

Visual explain. Start it with the db2cc or the db2vexp command. Use it to view the query access plan graph. The execution location for each operator is included in the detailed display of an operator. You can also find the remote SQL statement generated for each data source in the RQUERY (select operation) operator. By examining the details of each operator, you can see the number of rows estimated by the DB2 optimizer as input to and output from each operator. You can also see the estimated cost to execute each operator including the communications cost. See Appendix H, SQL Explain Tools for more information.
SQL explain. Start it with the db2expln or dynexpln command. Use it to view the access plan strategy as text. SQL explain does not provide cost information; however, you can get the access plan generated by the remote optimizer for those data sources supported by the remote explain function. See Appendix H, SQL Explain Tools for more information.

Understanding DB2 Optimization Decisions

This section lists optimization questions and key areas to investigate to improve performance. Key questions include:

Why isn't a join between two nicknames of the same data source being evaluated remotely?
Areas to examine include:
- Join operations. Can the data source support them?
- Join predicates. Can the join predicate be evaluated at the remote data source? If the answer is no, examine the join predicate. See Understanding Why a Query is Evaluated at a Data Source or at DB2 for more information.
- Number of rows in the join result (with visual explain). Does the join produce a much larger set of rows than the two nicknames combined? Do the numbers make sense? If the answer is no, consider updating the nickname statistics manually (SYSSTAT.TABLES).
Why isn't the GROUP BY operator being evaluated remotely?
Areas to examine include:
- Operator syntax. Verify that the operator can be evaluated at the remote data source. See Understanding Why a Query is Evaluated at a Data Source or at DB2 for more information.
- Number of rows. Check the estimated number of rows in the GROUP BY operator input and output using visual explain. Are these two numbers very close? If the answer is yes, the DB2 optimizer considers it more efficient to evaluate this GROUP BY locally. Also, do these two numbers make sense? If the answer is no, consider updating the nickname statistics manually (SYSSTAT.TABLES).
Why is the statement not being completely evaluated by the remote data source?
The DB2 Optimizer performs cost-based optimization. Even if pushdown analysis indicates that every operator can be evaluated at the remote data source, the optimizer still relies on its cost estimate to generate a globally optimal plan. There are a great many factors that can contribute to that plan. For example, even though the remote data source can process every operation in the original query, its CPU speed is much slower than DB2's and thus it may turn out to be more beneficial to perform the operations at DB2 instead. If results are not satisfactory, verify your server statistics in SYSCAT.SERVEROPTIONS.
Why does a plan generated by the optimizer, and completely evaluated at a remote data source, have much worse performance than the original query executed directly at the remote data source?
Areas to examine include:
- The remote SQL statement generated by the DB2 optimizer. Ensure that it is identical to the original query. Check for predicate ordering changes. A good query optimizer should not be sensitive to the predicate ordering of a query; unfortunately, not all DBMS optimizers are identical, and thus it is likely that the optimizer of the remote data source may generate a different plan based on the input predicate ordering. If this is true, this is a problem inherent in the remote optimizer. Consider either modifying the predicate ordering on the input to DB2 or contacting the service organization of the remote data source for assistance.
  Also, check for predicate replacements. A good query optimizer should not be sensitive to equivalent predicate replacements; unfortunately, not all DBMS optimizers are identical, and thus it is possible that the optimizer of the remote data source may generate a different plan based on the input predicate. For example, some optimizers cannot generate transitive closure statements for predicates.
- The number of returned rows. You can get this number from Visual Explain. If the query returns a large number of rows, network traffic is a potential bottleneck.
- Additional functions. Does the remote SQL statement contain additional functions compared with the original query? Some of the extra functions may be generated to convert data types. Ensure that they are necessary.

[ Top of Page | Previous Page | Next Page ]