3 Advanced Features : Using DataDirect Bulk Load

Using DataDirect Bulk Load
DataDirect Bulk Load is available in the following DataDirect Connect Series for ODBC drivers:
 
DataDirect Bulk Load is a feature that allows your application to send large numbers of rows of data to the database in a continuous stream instead of in numerous smaller database protocol packets. Although similar to parameter array batch operations, performance improves because far fewer network round trips are required. Bulk load bypasses the data parsing usually done by the database, providing an additional performance gain over batch operations.
IMPORTANT: Because a bulk load operation bypasses data integrity checks, your application must ensure that the data it is transferring does not violate integrity constraints in the database. For example, suppose you are bulk loading data into a database table and some of that data duplicates data stored as a primary key, which must be unique. The driver does not return an error; your application must provide its own data integrity checks.
DataDirect Bulk Load provides a simple method to do bulk load operations across databases without altering your application; otherwise, you would have to deal with the different proprietary tools of each database vendor. DataDirect Bulk Load works in a consistent way for all DataDirect Connect products that support bulk load functionality.
NOTE: DataDirect Bulk Load requires a licensed installation of the drivers for full bulk load functionality. If the drivers are installed with an evaluation license, the bulk load feature is available for prototyping with your applications, but with limited scope. Contact your sales representative or DataDirect SupportLink for further information.
Bulk load operations are accomplished by exporting the results of a query from a database into a comma-separated value (CSV) file, a bulk load data file. The driver then loads the data from bulk load data file into a different database. The file can be used by any DataDirect Connect Series for ODBC drivers. In addition, the bulk load data file is supported by other DataDirect Connect product lines that feature bulk loading, for example, a DataDirect Connect for ADO.NET data provider that supports bulk load.
Suppose that you had customer data on a Sybase server and needed to export it to a DB2 server. The driver would perform the following steps:
The figure shows how the driver bulk exports and loads the results of a query.
1
2
3
4
Bulk Export and Load Methods
You can take advantage of DataDirect Bulk Load either through the Driver setup dialog or programmatically.
Applications that are already coded to use parameter array batch functionality can leverage DataDirect Bulk Load features through the Enable Bulk Load connection option on the Bulk tab of the Driver setup dialog. Enabling this option automatically converts the parameter array batch operation to use the database bulk load protocol without any code changes to your application.
If you are not using parameter array batch functionality, the bulk operation buttons Export Table and Load Table on the Bulk tab of the driver Setup dialog also allow you to use bulk load functionality without any code changes. See the individual driver chapters for a description of the Bulk tab.
If you want to integrate bulk load functionality seamlessly into your application, you can include code to use the bulk load functions exposed by the driver.
NOTE: For your applications to use DataDirect Bulk Load functionality, they must obtain driver connection handles and function pointers, as follows:
1
Use SQLGetInfo with the parameter SQL_DRIVER_HDBC to obtain the driver’s connection handle from the Driver Manager.
2
Use SQLGetInfo with the parameter SQL_DRIVER_HLIB to obtain the driver’s shared library or DLL handle from the Driver Manager.
3
Obtain function pointers to the bulk load functions using the function name resolution method specific to your operating system. The bulk.c example program shipped with the drivers contains the function resolveName that illustrates how to obtain function pointers to the bulk load functions.
This is detailed in the code samples that follow and in “DataDirect Bulk Load Functions” in Chapter 9 of the DataDirect Connect Series for ODBC Reference
Exporting Data from a Database
You can export data from a database in one of three ways:
n
n
n
From the DataDirect driver Setup dialog, select the Bulk tab and click Export Table. See the individual driver chapters for a description of this procedure.
Your application can export a table using the DataDirect functions ExportTableToFile (ANSI application) or ExportTableToFileW (Unicode application). The application must first obtain driver connection handles and function pointers, as shown in the following example:
HDBC hdbc;
HENV henv;
void *driverHandle;
HMODULE hmod;
PExportTableToFile exportTableToFile;
 
char tableName[128];
char fileName[512];
char logFile[512];
int errorTolerance;
int warningTolerance;
int codePage;
 
/* Get the driver's connection handle from the DM.
This handle must be used when calling directly into the driver. */
 
rc = SQLGetInfo (hdbc, SQL_DRIVER_HDBC, &driverHandle, 0, NULL);
if (rc != SQL_SUCCESS) {
ODBC_error (henv, hdbc, SQL_NULL_HSTMT);
EnvClose (henv, hdbc);
exit (255);
}
 
/* Get the DM's shared library or DLL handle to the driver. */
 
rc = SQLGetInfo (hdbc, SQL_DRIVER_HLIB, &hmod, 0, NULL);
if (rc != SQL_SUCCESS) {
ODBC_error (henv, hdbc, SQL_NULL_HSTMT);
EnvClose (henv, hdbc);
exit (255);
}
 
exportTableToFile = (PExportTableToFile)
resolveName (hmod, "ExportTableToFile");
if (! exportTableToFile) {
printf ("Cannot find ExportTableToFile!\n");
exit (255);
}
 
rc = (*exportTableToFile) (
driverHandle,
(const SQLCHAR *) tableName,
(const SQLCHAR *) fileName,
codePage,
errorTolerance, warningTolerance,
(const SQLCHAR *) logFile);
if (rc == SQL_SUCCESS) {
printf ("Export succeeded.\n");
}
else {
driverError (driverHandle, hmod);
}
Refer to “DataDirect Bulk Load Functions” in Chapter 9 of the DataDirect Connect Series for ODBC Reference for a full description of these functions.
Your application can export a result set using the DataDirect statement attributes SQL_BULK_EXPORT and SQL_BULK_EXPORT_PARAMS. Refer to “DataDirect Bulk Load Statement Attributes” in Chapter 9 of the DataDirect Connect Series for ODBC Reference for a full description of these attributes.
The export operation creates a bulk load data file with a .csv extension in which the exported data is stored. For example, assume that an Oracle source table named GBMAXTABLE contains four columns. The resulting bulk load data file GBMAXTABLE.csv containing the results of a query would be similar to the following:
1,0x6263,"bc","bc"
2,0x636465,"cde","cde"
3,0x64656667,"defg","defg"
4,0x6566676869,"efghi","efghi"
5,0x666768696a6b,"fghijk","fghijk"
6,0x6768696a6b6c6d,"ghijklm","ghijklm"
7,0x68696a6b6c6d6e6f,"hijklmno","hijklmno"
8,0x696a6b6c6d6e6f7071,"ijklmnopq","ijklmnopq"
9,0x6a6b6c6d6e6f70717273,"jklmnopqrs","jklmnopqrs"
10,0x6b,"k","k"
A bulk load configuration file with and .xml extension is also created when either a table or a result set is exported to a bulk load data file. See “The Bulk Load Configuration File” for an example of a bulk load configuration file.
In addition, a log file of events as well as external overflow files can be created during a bulk export operation. The log file is configured through either the driver Setup dialog Bulk tab, the ExportTableToFile function, or the SQL_BULK_EXPORT statement attribute. The external overflow files are configured through connection options; see “External Overflow Files” for details.
Bulk Loading to a Database
The Enable Bulk Load connection option specifies the method by which bulk data is loaded to a database. When the option is enabled, the driver uses database bulk load protocols. When not enabled, the driver uses standard parameter arrays.
You can load data from the bulk load data file into the target database through the DataDirect driver Setup dialog by selecting the Bulk tab and clicking Load Table. See the individual driver chapters of the drivers that support bulk load for a description of this procedure.
Your application can also load data from the bulk load data file into the target database using the using the DataDirect functions LoadTableFromFile (ANSI application) or LoadTableFromFileW (Unicode application). The application must first obtain driver connection handles and function pointers, as shown in the following example:
HDBC hdbc;
HENV henv;
void *driverHandle;
HMODULE hmod;
PLoadTableFromFile loadTableFromFile;
 
char tableName[128];
char fileName[512];
char configFile[512];
char logFile[512];
char discardFile[512];
int errorTolerance;
int warningTolerance;
int loadStart;
int loadCount;
int readBufferSize;
 
/* Get the driver's connection handle from the DM.
This handle must be used when calling directly into the driver. */
 
rc = SQLGetInfo (hdbc, SQL_DRIVER_HDBC, &driverHandle, 0, NULL);
if (rc != SQL_SUCCESS) {
ODBC_error (henv, hdbc, SQL_NULL_HSTMT);
EnvClose (henv, hdbc);
exit (255);
}
 
/* Get the DM's shared library or DLL handle to the driver. */
 
rc = SQLGetInfo (hdbc, SQL_DRIVER_HLIB, &hmod, 0, NULL);
if (rc != SQL_SUCCESS) {
ODBC_error (henv, hdbc, SQL_NULL_HSTMT);
EnvClose (henv, hdbc);
exit (255);
}
 
loadTableFromFile = (PLoadTableFromFile)
resolveName (hmod, "LoadTableFromFile");
if (! loadTableFromFile) {
printf ("Cannot find LoadTableFromFile!\n");
exit (255);
}
 
rc = (*loadTableFromFile) (
driverHandle,
(const SQLCHAR *) tableName,
(const SQLCHAR *) fileName,
errorTolerance, warningTolerance,
(const SQLCHAR *) configFile,
(const SQLCHAR *) logFile,
(const SQLCHAR *) discardFile,
loadStart, loadCount,
readBufferSize);
if (rc == SQL_SUCCESS) {
printf ("Load succeeded.\n");
}
else {
driverError (driverHandle, hmod);
}
Refer to “DataDirect Bulk Load Functions” in Chapter 9 of the DataDirect Connect Series for ODBC Reference for a full description of these functions.
NOTE FOR SYBASE USERS: Additional database configuration is required for destination tables that do not have an index. See “Using DataDirect Bulk Load on Sybase” for more information.
A log file of events as well as a discard file that contains rows rejected during the load can be created during a bulk load operation. These files are configured through either the driver Setup dialog Bulk tab or the LoadTableFromFile function.
The discard file is in the same format as the bulk load data file. After fixing reported issues in the discard file, the bulk load can be reissued using the discard file as the bulk load data file.
The Bulk Load Configuration File
A bulk load configuration file is created when either a table or a result set is exported to a bulk load data file. This file has the same name as the bulk load data file, but with an .xml extension.
The bulk load configuration file defines in its metadata the names and data types of the columns in the bulk load data file. The file defines these names and data types based on the table or result set created by the query that exported the data.
It also defines other data properties, such as length for character and binary data types, the character encoding code page for character types, precision and scale for numeric types, and nullablity for all types.
When a bulk load data file cannot read its configuration file, the following defaults are assumed:
n
n
The default character set is defined, on Windows, by the current Windows code page. On UNIX/Linux, it is the IANAAppCodePage value, which defaults to 4.
For example, the format of the bulk load data file GBMAXTABLE.csv (discussed in “Exporting Data from a Database”) is defined by the bulk load configuration file, GBMAXTABLE.xml, as follows:
<?xml version="1.0" encoding="utf-8"?>
<table codepage="UTF-16LE" xsi:noNamespaceSchemaLocation=
"http://www.datadirect.com/ns/bulk/BulkData.xsd" xmlns:xsi=
"http://www.w3.org/2001/XMLSchema-instance">
<row>
<column datatype="DECIMAL" precision="38" scale="0" nullable=
"false">INTEGERCOL</column>
<column datatype="VARBINARY" length="10" nullable=
"true">VARBINCOL</column>
<column datatype="VARCHAR" length="10" sourcecodepage="Windows-1252"
externalfilecodepage="Windows-1252" nullable="true">VCHARCOL</column>
<column datatype="VARCHAR" length="10" sourcecodepage="Windows-1252"
externalfilecodepage="Windows-1252" nullable="true">UNIVCHARCOL</column>
</row>
</table>
Bulk Load Configuration File Schema
The bulk load configuration file is supported by an underlying XML Schema defined at:
http://www.datadirect.com/ns/bulk/BulkData.xsd
The bulk load configuration file must conform to the bulk load configuration XML schema. Each bulk export operation generates a bulk load configuration file in UTF-8 format. If the bulk load data file cannot be created or does not comply with the XML Schema described in the bulk load configuration file, an error is generated.
Verification of the Bulk Load Configuration File
You can verify the metadata in the configuration file against the data structure of the target database table. This insures that the data in the bulk load data file is compatible with the target database table structure.
The verification does not check the actual data in the bulk load data file, so it is possible that the load can fail even though the verification succeeds. For example, if you were to update the bulk load data file manually such that it has values that exceed the maximum column length of a character column in the target table, the load would fail.
Not all of the error messages or warnings that are generated by verification necessarily mean that the load will fail. Many of the messages simply notify you about possible incompatibilities between the source and target tables. For example, if the bulk load data file has a column that is defined as an integer and the column in the target table is defined as smallint, the load may still succeed if the values in the source column are small enough that they fit in a smallint column.
To verify the metadata in the bulk load configuration file through the DataDirect driver Setup dialog, select the Bulk tab and click Verify. See the individual driver chapters of the drivers that support bulk load for a description of this procedure.
Your application can also verify the metadata of the bulk load configuration file using the DataDirect functions ValidateTableFromFile (ANSI application) or ValidateTableFromFileW (Unicode application). The application must first obtain driver connection handles and function pointers, as shown in the following example:
HDBC hdbc;
HENV henv;
void *driverHandle;
HMODULE hmod;
PValidateTableFromFile validateTableFromFile;
 
char tableName[128];
char configFile[512];
char messageList[10240];
SQLLEN numMessages;
 
/* Get the driver's connection handle from the DM.
This handle must be used when calling directly into the driver. */
 
rc = SQLGetInfo (hdbc, SQL_DRIVER_HDBC, &driverHandle, 0, NULL);
if (rc != SQL_SUCCESS) {
ODBC_error (henv, hdbc, SQL_NULL_HSTMT);
EnvClose (henv, hdbc);
exit (255);
}
 
/* Get the DM's shared library or DLL handle to the driver. */
 
rc = SQLGetInfo (hdbc, SQL_DRIVER_HLIB, &hmod, 0, NULL);
if (rc != SQL_SUCCESS) {
ODBC_error (henv, hdbc, SQL_NULL_HSTMT);
EnvClose (henv, hdbc);
exit (255);
}
 
validateTableFromFile = (PValidateTableFromFile)
resolveName (hmod, "ValidateTableFromFile");
if (!validateTableFromFile) {
printf ("Cannot find ValidateTableFromFile!\n");
exit (255);
}
 
messageList[0] = 0;
numMessages = 0;
 
rc = (*validateTableFromFile) (
driverHandle,
(const SQLCHAR *) tableName,
(const SQLCHAR *) configFile,
(SQLCHAR *) messageList,
sizeof (messageList),
&numMessages);
printf ("%d message%s%s\n", numMessages,
(numMessages == 0) ? "s" :
((numMessages == 1) ? " : " : "s : "),
(numMessages > 0) ? messageList : "");
if (rc == SQL_SUCCESS) {
printf ("Validate succeeded.\n");
}
else {
driverError (driverHandle, hmod);
}
Refer to “DataDirect Bulk Load Functions” in Chapter 9 of the DataDirect Connect Series for ODBC Reference for a complete description of these functions.
Sample Application
DataDirect Tehcnologies provides a sample application that demonstrates the bulk export, verification, and bulk load operations. This application is located in the \example\bulk subdirectory of the product installation directory along with a text file named bulk.txt. Please consult bulk.txt for instructions on using the sample bulk load application.
Determining the Bulk Load Protocol
Bulk operations can be performed using a dedicated bulk load protocol, that is, the protocol of the underlying database system, or by using parameter array batch operations. Dedicated protocols are generally more performance-efficient than parameter arrays. In some cases, however, you must use parameter arrays, for example, when the data to be loaded is in a data type not supported by the dedicated bulk protocol.
The Enable Bulk Load connection option determines bulk load behavior. When the option is enabled, the driver uses database bulk load protocols unless it encounters a problem, in which case it returns an error. In this situation, you must disable Enable Bulk Load so that the driver uses standard parameter arrays.
Character Set Conversions
It is most performance-efficient to transfer data between databases that use the same character sets. At times, however, you might need to bulk load data between databases that use different character sets. You can do this by choosing a character set for the bulk load data file that will accommodate all data. If the source table contains character data that uses different character sets, then one of the Unicode character sets, UTF-8, UTF-16BE, or UTF-16LE must be specified for the bulk load data file. A Unicode character set should also be specified in the case of a target table uses a different character set than the source table to minimize conversion errors. If the source and target tables use the same character set, that set should be specified for the bulk load data file.
A character set is defined by a code page. The code page for the bulk load data file is defined in the configuration file and is specified through either the Code Page option of the Export Table driver Setup dialog or through the IANAAppCodePage parameter of the ExportTableToFile function. Any code page listed in Chapter 1 “Code Page Values”of the DataDirect Connect Series for ODBC Reference is supported for the bulk load data file.
Any character conversion errors are handled based on the value of the Report CodePage ConversionErrors connection option. See the individual driver chapters for a description of this option.
The configuration file may optionally define a second code page value for each character column (externalfilecodepage). If character data is stored in an external overflow file (see “External Overflow Files”), this second code page value is used for the external file.
External Overflow Files
In addition to the bulk load data file, DataDirect Bulk Load can store bulk data in external overflow files. These overflow files are located in the same directory as the bulk load data file. Whether or not to use external overflow files is a performance consideration. For example, binary data is stored as hexadecimal-encoded character strings in the main bulk load data file, which increases the size of the file per unit of data stored. External files do not store binary data as hex character strings, and, therefore, require less space. On the other hand, more overhead is required to access external files than to access a single bulk load data file, so each bulk load situation must be considered individually.
The value of the Bulk Binary Threshold connection option determines the threshold, in KB, over which binary data is stored in external files instead of in the bulk load data file. Likewise, the Bulk Character Threshold connection option determines the threshold for character data.
In the case of an external character data file, the character set of the file is governed by the bulk load configuration file. If the bulk load data file is Unicode and the maximum character size of the source data is 1, then the data is stored in its source code page. See “Character Set Conversions”.
The file name of the external file contains the bulk load data file name, a six-digit number, and a ".lob" extension in the following format: CSVfilename_nnnnnn.lob. Increments start at 000001.lob.
Summary of Bulk Load Related Options
Table 3-6 summarizes how DataDirect Bulk Load related connection options work with the drivers. See "Connection Option Descriptions" in each driver chapter for details about configuring the options.
 
An integer value that specifies the number of rows at a time that the driver sends to the database during bulk operations.
An integer value that specifies the maximum size, in KB, of binary data exported to the bulk data file. Any data exceeding this size is exported to an external file.
An integer value that specifies the maximum size, in KB, of character data exported to the bulk data file. Any data exceeding this size is exported to an external file.
When enabled, the driver uses database bulk load protocols. When not enabled, the driver uses standard parameter arrays.