Data Movement Utilities Guide and Reference
The performance of the load utility depends on the nature and the quantity
of the data, the number of indexes, and the load options specified.
Unique indexes reduce load performance if duplicates are
encountered. In most cases, it is still more efficient to create
indexes during the load operation than to invoke the CREATE INDEX statement
for each index after the load operation completes (see Figure 5).
Figure 5. Increasing Load Performance through Concurrent Indexing and Statistics Collection.. Tables are normally built in three steps: data loading, index building, and statistics collection. This causes multiple data I/O during the load operation, during index creation (there can be several indexes for each table), and during statistics collection (which causes I/O on the table data and on all of the indexes). A much faster alternative is to let the load utility complete all of these tasks in one pass through the data.
When tuning index creation performance, the amount of memory dedicated to
the sorting of index keys during a load operation is controlled by the
sortheap database configuration parameter. For example, to
direct the load utility to use 4000 pages of main memory for index key
sorting, set the sortheap database configuration parameter to be 4000
pages, disconnect all applications from the database, and then issue the LOAD
command.
Load performance can be improved by installing high performance sorting
libraries from third party vendors to create indexes during the load
operation. An example of a third party sort product is SyncSort.
Use the DB2SORT environment variable (registry value) to specify
the location of the sorting library that is to be loaded at run time.
For more information about environment variables, see the Administration Guide.
Use of the SET INTEGRITY statement may lengthen the total time needed to
load a table and make it usable again. If all the load operations are
performed in INSERT mode, the SET INTEGRITY statement will check the table for
constraints violations incrementally (by checking only the appended portion of
the table). If a table cannot be checked for constraints violations
incrementally, the entire table is checked, and it may be some time before the
table is usable again.
The load utility performs equally well in INSERT mode and in REPLACE
mode.
The utility attempts to deliver the best performance possible by
determining optimal values for DISK_PARALLELISM, CPU_PARALLELISM, and
DATA_BUFFER, if these parameters have not be specified by the user.
Optimization is done based on the size and the free space available in the
utility heap. Consider allowing the load utility to choose values for
these parameters before attempting to tune them for your particular
needs.
Following is information about the performance implications of various
options available through the load utility:
- ANYORDER
- Specify this file type modifier to suspend the preservation of order in
the data being loaded, and improve performance. If the data to be
loaded is presorted, anyorder may corrupt the presorted order, and
the benefits of presorting will be lost for subsequent queries.
- BINARY NUMERICS and PACKED DECIMAL
- Use these file type modifiers to improve performance when loading
positional numeric ASC data into fixed-length records.
- COPY YES or NO
- Use this parameter to specify whether a copy of the input data is to be
made during a load operation. COPY YES reduces load performance,
because all of the loading data is copied during the load operation (forward
recovery must be enabled); the increased I/O activity may increase the load
time on an I/O-bound system. Specifying multiple devices or directories
(on different disks) can offset some of the performance penalty resulting from
this operation. COPY NO may reduce overall performance, because if
forward recovery is enabled, the table is placed in backup pending state, and
the database, or selected table spaces, must be backed up before the table can
be accessed.
- CPU_PARALLELISM
- Use this parameter to exploit intra-partition parallelism (if this is part
of your machine's capability), and significantly improve load
performance. The parameter specifies the number of processes or threads
used by the load utility to parse, convert, and format data records.
The maximum number allowed is 30. This parameter is particularly useful
when loading presorted data, because record order in the source data is
preserved (see Figure 6). If there is insufficient memory to support the
specified value, the utility adjusts the value. If this parameter is
not specified, the load utility selects a default value that is based on the
number of CPUs on the system.
If tables include either LOB or LONG VARCHAR data, CPU_PARALLELISM is
set to one. Parallelism is not supported in this case.
Although use of this parameter is not restricted to symmetric
multiprocessor (SMP) hardware, you may not obtain any discernible performance
benefit from using it in non-SMP environments.
Figure 6. Record Order in the Source Data is Preserved When Intra-partition Parallelism is Exploited During a Load Operation
- DATA BUFFER
- The DATA BUFFER parameter specifies the total amount of memory allocated
to the load utility as a buffer. It is recommended that this buffer be
several extents in size. An extent is the unit of movement
for data within DB2, and the extent size can be one or more 4KB pages.
The DATA BUFFER parameter is useful when working with large objects (LOBs); it
reduces I/O waiting time. The data buffer is allocated from the utility
heap. Depending on the amount of storage available on your system, you
should consider allocating more memory for use by the DB2 utilities.
The database configuration parameter util_heap_sz can be
modified accordingly. For information about the UPDATE DATABASE
CONFIGURATION command, see the Command Reference. The default value for the Utility Heap Size configuration parameter
is 5 000 4KB pages. Because load is only one of several
utilities that use memory from the utility heap, it is recommended that no
more than fifty percent of the pages defined by this parameter be available
for the load utility, and that the utility heap be defined large
enough. For more information about util_heap_sz, see the
Administration Guide.
- DISK_PARALLELISM
- The DISK_PARALLELISM parameter specifies the number of processes or
threads used by the load utility to write data records to disk. Use
this parameter to exploit available containers when loading data, and
significantly improve load performance. The maximum number allowed is
the greater of four times the CPU_PARALLELISM value (actually used by the
load utility), or 50. By default, DISK_PARALLELISM is equal to the
sum of the table space containers on all table spaces containing objects for
the table being loaded, except where this value exceeds the maximum number
allowed.
- FASTPARSE
- Use the fastparse file type modifier to reduce the data
checking that is performed on user-supplied column values, and enhance
performance. This option should only be used when the data being loaded
is known to be valid. It can improve performance by about 10 or 20
percent.
- NONRECOVERABLE
- Use this parameter if you do not need to be able to recover load
transactions against a table. Load performance is enhanced, because no
additional activity beyond the movement of data into the table is required,
and the load operation completes without leaving the table spaces in backup
pending state.
Note: | When these load transactions are encountered during subsequent restore and
roll-forward recovery, the table is not updated, and is marked
"invalid". Further actions against this table are ignored. After
the roll-forward operation is complete, the table can only be dropped.
|
- NOROWWARNINGS
- Use the norowwarnings file type modifier to suppress the
recording of warnings about rejected rows, and enhance performance, if you
anticipate a large number of warnings.
- SAVECOUNT
- Use this parameter to set an interval for the establishment of consistency
points during a load operation. The synchronization of activities
performed to establish a consistency point takes time. If done too
frequently, there will be a noticeable reduction in load performance.
If a very large number of rows is to be loaded, it is recommended that a large
SAVECOUNT value be specified (for example, a value of ten million in the case
of a load operation involving 100 million records).
A LOAD RESTART operation will automatically continue from the last
consistency point.
- STATISTICS YES
- Use this parameter to collect data distribution and index statistics more
efficiently than through invocation of the runstats utility following
completion of the load operation, even though performance of the load
operation itself will decrease (particularly when DETAILED INDEXES ALL is
specified).
For optimal performance, applications require the best data distribution
and index statistics possible. Once the statistics are updated,
applications can use new access paths to the table data based on the latest
statistics. New access paths to a table can be created by rebinding the
application packages using the DB2 BIND command (see the Command Reference).
When loading data into large tables, it is recommended that a larger value
for the stat_heap_sz (Statistics Heap Size) database
configuration parameter be specified. For information about the UPDATE
DATABASE CONFIGURATION command, see the Command
Reference. For more information about stat_heap_sz, see
the Administration Guide.
- WARNINGCOUNT
- Use this parameter to specify the number of warnings that can be returned
by the utility before a load operation is forced to terminate. If you
are expecting only a few warnings or no warnings, set the WARNINGCOUNT
parameter to approximately the number you are expecting, or to twenty if you
are expecting no warnings. The load operation will stop after the
WARNINGCOUNT number is reached. This gives you the opportunity to
correct data (or to drop and then recreate the table being loaded) before
attempting to complete the load operation. Although not having a direct
effect on the performance of the load operation, the establishment of a
WARNINGCOUNT threshold prevents you from having to wait until the entire load
operation completes before determining that there is a problem.
[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]
[ DB2 List of Books |
Search the DB2 Books ]