IBM Books

Data Movement Utilities Guide and Reference


Optimizing Load Performance

The performance of the load utility depends on the nature and the quantity of the data, the number of indexes, and the load options specified.

Unique indexes reduce load performance if duplicates are encountered. In most cases, it is still more efficient to create indexes during the load operation than to invoke the CREATE INDEX statement for each index after the load operation completes (see Figure 5).

Figure 5. Increasing Load Performance through Concurrent Indexing and Statistics Collection.. Tables are normally built in three steps: data loading, index building, and statistics collection. This causes multiple data I/O during the load operation, during index creation (there can be several indexes for each table), and during statistics collection (which causes I/O on the table data and on all of the indexes). A much faster alternative is to let the load utility complete all of these tasks in one pass through the data.

IMAGE9


When tuning index creation performance, the amount of memory dedicated to the sorting of index keys during a load operation is controlled by the sortheap database configuration parameter. For example, to direct the load utility to use 4000 pages of main memory for index key sorting, set the sortheap database configuration parameter to be 4000 pages, disconnect all applications from the database, and then issue the LOAD command.

Load performance can be improved by installing high performance sorting libraries from third party vendors to create indexes during the load operation. An example of a third party sort product is SyncSort. Use the DB2SORT environment variable (registry value) to specify the location of the sorting library that is to be loaded at run time. For more information about environment variables, see the Administration Guide.

Use of the SET INTEGRITY statement may lengthen the total time needed to load a table and make it usable again. If all the load operations are performed in INSERT mode, the SET INTEGRITY statement will check the table for constraints violations incrementally (by checking only the appended portion of the table). If a table cannot be checked for constraints violations incrementally, the entire table is checked, and it may be some time before the table is usable again.

The load utility performs equally well in INSERT mode and in REPLACE mode.

The utility attempts to deliver the best performance possible by determining optimal values for DISK_PARALLELISM, CPU_PARALLELISM, and DATA_BUFFER, if these parameters have not be specified by the user. Optimization is done based on the size and the free space available in the utility heap. Consider allowing the load utility to choose values for these parameters before attempting to tune them for your particular needs.

Following is information about the performance implications of various options available through the load utility:

ANYORDER
Specify this file type modifier to suspend the preservation of order in the data being loaded, and improve performance. If the data to be loaded is presorted, anyorder may corrupt the presorted order, and the benefits of presorting will be lost for subsequent queries.

BINARY NUMERICS and PACKED DECIMAL
Use these file type modifiers to improve performance when loading positional numeric ASC data into fixed-length records.

COPY YES or NO
Use this parameter to specify whether a copy of the input data is to be made during a load operation. COPY YES reduces load performance, because all of the loading data is copied during the load operation (forward recovery must be enabled); the increased I/O activity may increase the load time on an I/O-bound system. Specifying multiple devices or directories (on different disks) can offset some of the performance penalty resulting from this operation. COPY NO may reduce overall performance, because if forward recovery is enabled, the table is placed in backup pending state, and the database, or selected table spaces, must be backed up before the table can be accessed.

CPU_PARALLELISM
Use this parameter to exploit intra-partition parallelism (if this is part of your machine's capability), and significantly improve load performance. The parameter specifies the number of processes or threads used by the load utility to parse, convert, and format data records. The maximum number allowed is 30. This parameter is particularly useful when loading presorted data, because record order in the source data is preserved (see Figure 6). If there is insufficient memory to support the specified value, the utility adjusts the value. If this parameter is not specified, the load utility selects a default value that is based on the number of CPUs on the system.

If tables include either LOB or LONG VARCHAR data, CPU_PARALLELISM is set to one. Parallelism is not supported in this case.

Although use of this parameter is not restricted to symmetric multiprocessor (SMP) hardware, you may not obtain any discernible performance benefit from using it in non-SMP environments.

Figure 6. Record Order in the Source Data is Preserved When Intra-partition Parallelism is Exploited During a Load Operation

IMAGE3


DATA BUFFER
The DATA BUFFER parameter specifies the total amount of memory allocated to the load utility as a buffer. It is recommended that this buffer be several extents in size. An extent is the unit of movement for data within DB2, and the extent size can be one or more 4KB pages. The DATA BUFFER parameter is useful when working with large objects (LOBs); it reduces I/O waiting time. The data buffer is allocated from the utility heap. Depending on the amount of storage available on your system, you should consider allocating more memory for use by the DB2 utilities. The database configuration parameter util_heap_sz can be modified accordingly. For information about the UPDATE DATABASE CONFIGURATION command, see the Command Reference. The default value for the Utility Heap Size configuration parameter is 5 000 4KB pages. Because load is only one of several utilities that use memory from the utility heap, it is recommended that no more than fifty percent of the pages defined by this parameter be available for the load utility, and that the utility heap be defined large enough. For more information about util_heap_sz, see the Administration Guide.

DISK_PARALLELISM
The DISK_PARALLELISM parameter specifies the number of processes or threads used by the load utility to write data records to disk. Use this parameter to exploit available containers when loading data, and significantly improve load performance. The maximum number allowed is the greater of four times the CPU_PARALLELISM value (actually used by the load utility), or 50. By default, DISK_PARALLELISM is equal to the sum of the table space containers on all table spaces containing objects for the table being loaded, except where this value exceeds the maximum number allowed.

FASTPARSE
Use the fastparse file type modifier to reduce the data checking that is performed on user-supplied column values, and enhance performance. This option should only be used when the data being loaded is known to be valid. It can improve performance by about 10 or 20 percent.

NONRECOVERABLE
Use this parameter if you do not need to be able to recover load transactions against a table. Load performance is enhanced, because no additional activity beyond the movement of data into the table is required, and the load operation completes without leaving the table spaces in backup pending state.
Note:When these load transactions are encountered during subsequent restore and roll-forward recovery, the table is not updated, and is marked "invalid". Further actions against this table are ignored. After the roll-forward operation is complete, the table can only be dropped.

NOROWWARNINGS
Use the norowwarnings file type modifier to suppress the recording of warnings about rejected rows, and enhance performance, if you anticipate a large number of warnings.

SAVECOUNT
Use this parameter to set an interval for the establishment of consistency points during a load operation. The synchronization of activities performed to establish a consistency point takes time. If done too frequently, there will be a noticeable reduction in load performance. If a very large number of rows is to be loaded, it is recommended that a large SAVECOUNT value be specified (for example, a value of ten million in the case of a load operation involving 100 million records).

A LOAD RESTART operation will automatically continue from the last consistency point.

STATISTICS YES
Use this parameter to collect data distribution and index statistics more efficiently than through invocation of the runstats utility following completion of the load operation, even though performance of the load operation itself will decrease (particularly when DETAILED INDEXES ALL is specified).

For optimal performance, applications require the best data distribution and index statistics possible. Once the statistics are updated, applications can use new access paths to the table data based on the latest statistics. New access paths to a table can be created by rebinding the application packages using the DB2 BIND command (see the Command Reference).

When loading data into large tables, it is recommended that a larger value for the stat_heap_sz (Statistics Heap Size) database configuration parameter be specified. For information about the UPDATE DATABASE CONFIGURATION command, see the Command Reference. For more information about stat_heap_sz, see the Administration Guide.

WARNINGCOUNT
Use this parameter to specify the number of warnings that can be returned by the utility before a load operation is forced to terminate. If you are expecting only a few warnings or no warnings, set the WARNINGCOUNT parameter to approximately the number you are expecting, or to twenty if you are expecting no warnings. The load operation will stop after the WARNINGCOUNT number is reached. This gives you the opportunity to correct data (or to drop and then recreate the table being loaded) before attempting to complete the load operation. Although not having a direct effect on the performance of the load operation, the establishment of a WARNINGCOUNT threshold prevents you from having to wait until the entire load operation completes before determining that there is a problem.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]