Data Movement Utilities Guide and Reference

Exporting Data in Parallel

Exporting data in parallel reduces data transfer, and distributes the writing of the result set, as well as the generation of the formatted output, across nodes in a more effective manner than would otherwise be the case. When data is exported in parallel (by invoking multiple export operations, one for each partition of a table), it is extracted, converted on the local nodes, and then written to the local file system. In contrast, when exporting data serially (exporting through a single export operation), it is extracted in parallel and then shipped to the client, where a single process performs conversion and writes the result set to a local file system.

The db2batch command is used to monitor the performance characteristics and execution duration of SQL statements. This utility also has a parallel export function in partitioned database environments that:

Runs queries to define the data to be exported
On each partition, creates a file containing the exported data that resides on that partition.

There are two types of parallel export, depending on what tables are queried, and where those tables exist in the partitioned database environment:

Parallel export from a single partition table, or a join or subquery on multiple tables that are collocated (specify -p s on the db2batch command).
There are two ways in which tables can be considered collocated:
- The tables are in a single partition nodegroup defined on the same partition.
- The tables are in the same nodegroup, and have partitioning keys with the same number and type of columns. The corresponding columns of the partitioning key are partition compatible, and the tables are equi-joined on the entire partition key, or a superset of the partition key.
In each case, the query can be run on each partition to generate that partition's export data file using the NODENUMBER function as described below. (Note that if a table exists in a single partition only, export parallelism is negated, because the data is retrieved from only one partition. To enable export parallelism in this case, refer to the next bullet.)
A SELECT from multiple non-collocated tables (specify -p t tablename or -p d on the db2batch command; the former allows you to specify the name of an existing table to use as a staging table, while the latter causes the export utility to create a staging table).
The export utility uses a staging table that is populated through the export query. This staging table is used to locate the rows of the "export" result set by running an INSERT of the fullselect query into the staging table. Once the staging table is created, the export utility generates an export data file at each partition by running:
```
   "select * WHERE NODENUMBER(colname) = CURRENT NODE"
```
on the staging table.
A staging table can also be used to export a single partition table in parallel. In most cases, transferring the data from a single partition into a multi-partition staging table, and then exporting the staging table in parallel on all partitions, is faster than exporting the single partition table serially.

The export utility runs a query in parallel on each partition to retrieve the data on that partition. In the case of db2batch -p s, the original SELECT query is run in parallel. In the case of db2batch -p t and db2batch -p d, a staging table is loaded with the export data, using the specified query, and a SELECT * query is run on the staging table in parallel on each partition to export the data. To export only the data that resides on a given partition, db2batch adds the predicate NODENUMBER(colname) = CURRENT NODE to the WHERE clause of the query that is run on that partition. The colname parameter must be set to the qualified or the unqualified name of a table column. The export utility uses the first column name in the original query to set this parameter.

Following are the limitations on queries used by the export utility:

When specifying db2batch -p s, the query must not contain only column functions, because a column name is needed for the NODENUMBER colname predicate.
When specifying db2batch -p s, aggregates (such as min, max, and avg) must be based on a grouping that includes the partitioning key.
When specifying db2batch -p t or db2batch -p d, the query cannot contain ORDER BY, because ORDER BY on a fullselect within an INSERT statement is not supported by DB2 UDB.

If, when specifying -p s on the db2batch command, you also use the -r option to create result output files, the files on each partition will be in sorted order if you have an ORDER BY clause. If a single sorted file is your objective, merge the sorted file on each partition into one sorted file. For example, on UNIX based systems, use the command sort -m to merge the files into a single sorted file. If you are sending your output to an NFS mounted file system, the output will not be sorted, even if you specify the ORDER BY clause.

[ Top of Page | Previous Page | Next Page ]