3 Advanced Features : Using DataDirect Bulk Load

Using DataDirect Bulk Load
DataDirect Bulk Load is a feature that allows your application to send large numbers of rows of data to the database in a continuous stream instead of in numerous smaller database protocol packets. Although similar to parameter array batch operations, performance improves because far fewer network round trips are required. Bulk load bypasses the data parsing usually done by the database, providing an additional performance gain over batch operations.
IMPORTANT: Because a bulk load operation bypasses data integrity checks, your application must ensure that the data it is transferring does not violate integrity constraints in the database. For example, suppose you are bulk loading data into a database table and some of that data duplicates data stored as a primary key, which must be unique. The data provider does not return an error; your application must provide its own data integrity checks.
DataDirect Bulk Load provides a simple method to do bulk load operations across databases without altering your application; otherwise, you would have to deal with the different proprietary tools of each database vendor. DataDirect Bulk Load works in a consistent way for all DataDirect Connect products that support bulk load functionality.
Bulk load operations are accomplished by exporting the results of a query from a database into a comma-separated value (CSV) file, a bulk load data file. The data provider then loads the data from bulk load data file into a different database. The file can be used by any DataDirect Connect for ADO.NET data provider. In addition, the bulk load data file is supported by other DataDirect Connect products that feature bulk loading, for example, a DataDirect Connect Series for ODBC driver that supports bulk load.
Suppose you need to load data into Oracle, DB2, and Sybase. In the past, you probably had to use a proprietary tool from each database vendor for bulk load operations, or write your own tool. Now, because of the interoperability built into DataDirect Bulk Load, your task is much easier. Another advantage is that DataDirect Bulk Load uses 100% managed code and requires no underlying utilities or libraries from other vendors.
The DataDirect Bulk Load implementation for ADO.NET uses the de facto standard defined by the Microsoft SqlBulkCopy classes, and adds powerful built-in features to enhance interoperability as well as the flexibility to make bulk operations more reliable.
The DataDirect Connect for ADO.NET data providers include provider-specific classes to support DataDirect Bulk Load. See “Provider-specific Classes” for more information. If you use the Common Programming Model, you can use the classes in the DataDirect Common Assembly (see “DataDirect Common Assembly”).
DataDirect Common Assembly
The DDTek.Data.Common assembly provides features that apply to all of the DataDirect Connect for ADO.NET data providers. In this release, the Common assembly includes classes that support DataDirect Bulk Load, such as the CsvDataReader and CsvDataWriter classes that provide functionality between bulk data formats.
The Common assembly also extends support for bulk load classes that use the Common Programming Model. This means that the SqlBulkCopy patterns can now be used in a new DbBulkCopy hierarchy.
See “DataDirect Classes” for more information on the classes supported by the DDTek.Data.Common assembly.
Use Scenarios for DataDirect Bulk Load
Two of the ways you can use DataDirect Bulk Load with the DataDirect data providers are:
One use would be copying data between data sources from the same vendor. For example, after upgrading to a new version of Oracle, you copy inventory data from an Oracle 10g data source to an Oracle 11g data source, as shown in Figure 3-3.
Figure 3-3. Using DataDirect Bulk Load Between Two Data SourcesThe .NET application sends a query to the Oracle 10g server using the Oracle data provider. The results of the query are sent in IDataReader using Bulk Copy to the Oracle 11g server.
 
Figure 3-4 shows an ODBC environment copying data to an ADO.NET database server.
Figure 3-4. Copying Data from ODBC to ADO.NETAn ODBC application sends a query to the Sybase server. A CSV file with the results is sent to the ADO.NET application, which sends the results to an Oracle 11g server using the ADO.NET data provider.
 
In this figure, the ODBC application includes code to export data to the CSV file, and the ADO.NET application includes code to specify and open the CSV file. Because the DataDirect Connect for ADO.NET data providers and DataDirect Connect for ODBC drivers use a consistent format, interoperability is supported via these standard interfaces.
Bulk Load Data File
Bulk load operations between dissimilar data stores are accomplished by persisting the results of the query in a comma-separated value (CSV) format file, a bulk load data file. The file can be used between any DataDirect Connect for ADO.NET data providers that support bulk load. In addition, the bulk load data file can be used with any DataDirect Connect product that supports the bulk load functionality. For example, the CSV file generated by a DataDirect Connect for ADO.NET data provider can be used by a DataDirect Connect for ODBC driver that supports bulk load.
Example
The Oracle source table GBMAXTABLE contains four columns. The following C# code fragment writes the GBMAXTABLE.csv and GBMAXTABLE.xml files that will be created by the CsvDataWriter. Note that this example uses the DbDataReader class.
cmd.CommandText = "SELECT * FROM GBMAXTABLE ORDER BY INTEGERCOL";
DbDataReader reader = cmd.ExecuteReader();
CsvDataWriter csvWriter = new CsvDataWriter();
csvWriter.WriteToFile("\\NC1\net\Oracle\GBMAXTABLE\GBMAXTABLE.csv", reader);
The bulk load data file GBMAXTABLE.csv contains the results of the query:
1,0x6263,"bc","bc"
2,0x636465,"cde","cde"
3,0x64656667,"defg","defg"
4,0x6566676869,"efghi","efghi"
5,0x666768696a6b,"fghijk","fghijk"
6,0x6768696a6b6c6d,"ghijklm","ghijklm"
7,0x68696a6b6c6d6e6f,"hijklmno","hijklmno"
8,0x696a6b6c6d6e6f7071,"ijklmnopq","ijklmnopq"
9,0x6a6b6c6d6e6f70717273,"jklmnopqrs","jklmnopqrs"
10,0x6b,"k","k"
The GBMAXTABLE.xml file, which provides the format of this bulk load data file, is described in the following section.
Bulk Load Configuration File
A bulk load configuration file is created when the CsvDataWriter.WriteToFile method is called. This file has the same name as the bulk load data file, but with an .xml extension.
The bulk load configuration file defines in its metadata the names and data types of the columns in the bulk load data file. The file defines these names and data types based on the table or result set created by the query that exported the data.
It also defines other data properties, such as length for character and binary data types, the character encoding code page for character types, precision and scale for numeric types, and nullablity for all types.
Example
The format of the bulk load data file shown in the previous section is defined by the bulk load configuration file, GBMAXTABLE.xml. The file describes the data type and other information about each of the four columns in the table.
<?xml version="1.0" encoding="utf-8"?>
<table codepage="UTF-16LE" xsi:noNamespaceSchemaLocation="http://www.datadirect.com/ns/bulk/BulkData.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<column datatype="DECIMAL" precision="38" scale="0" nullable="false">INTEGERCOL</column>
<column datatype="VARBINARY" length="10" nullable="true">VARBINCOL</column>
<column datatype="VARCHAR" length="10" sourcecodepage="Windows-1252"
externalfilecodepage="Windows-1252" nullable="true">VCHARCOL</column>
<column datatype="VARCHAR" length="10" sourcecodepage="Windows-1252"
externalfilecodepage="Windows-1252" nullable="true">UNIVCHARCOL</column>
</row>
</table>
See “Sample Bulk Data Configuration File” for a more complex example of a bulk format data configuration file.
If the bulk data file cannot be created or does not comply with the schema described in the XML configuration file, an exception is thrown. See “XML Schema Definition for a Bulk Data Configuration File” for the use of an XML schema definition.
If a bulk load data file that does not have a configuration file is read, the following defaults are assumed:
The bulk load configuration file describes the bulk data file and is supported by an underlying XML Schema defined at:
http://www.datadirect.com/ns/bulk/BulkData.xsd.
Determining the Bulk Load Protocol
Bulk operations can be performed using dedicated bulk protocol, that is, the data provider uses the protocol of the underlying database. In some cases, the dedicated bulk protocol is not available, for example, when the data to be loaded is in a data type not supported by the dedicated bulk protocol. Then, the data provider automatically uses a non-bulk method such as array binding to perform the bulk operation, maintaining optimal application uptime.
For the Oracle data provider, you can set the Bulk Load Protocol connection string option to determine the behavior when the dedicated bulk load protocol is not available. For example, the data provider can use bulk load protocol and fail if bulk load protocol is not possible, or use only array binding for bulk loading.
Character Set Conversions
It is most performance-efficient to transfer data between databases that use the same character sets. At times, however, you might need to bulk load data between databases that use different character sets. You can do this by choosing a character set for the bulk load data file that can accommodate all data.
For the DataDirect Connect for ADO.NET data providers, the default source character data, that is, the output from the CsvDataReader and the input to the CsvDataWriter, is in Unicode (UTF-16) format. The source character data is always transliterated to the code page of the CSV file. If the threshold is exceeded and data is written to the external overflow file, the source character data is transliterated to the code page specified by the externalfilecodepage attribute defined in the bulk configuration XML schema (see “XML Schema Definition for a Bulk Data Configuration File”). If the configuration file does not define a value for externalfilecodepage, the CSV file code page is used.To avoid unnecessary transliteration, it's best for the CSV and external file character data to be stored in Unicode (UTF-16).
You might want your applications to store the data in another code page in one of the following scenarios:
The configuration file may optionally define a second code page for each character column. When character data exceeds the value defined by the CharacterThreshold property and is stored in a separate file (see “External Overflow File”), the value defines the code page for that file.
If the value is omitted or if the code page defined by the source column is unknown, the code page defined for the CSV file is used.
External Overflow File
In addition to the bulk load data file, DataDirect Bulk Load can store bulk data in external overflow files. These overflow files are located in the same directory as the bulk load data file. Whether or not to use external overflow files is a performance consideration. For example, binary data is stored as hexadecimal-encoded character strings in the main bulk load data file, which increases the size of the file per unit of data stored. External files do not store binary data as hex character strings, and, therefore, require less space. On the other hand, more overhead is required to access external files than to access a single bulk load data file, so each bulk load situation must be considered individually.
If the BinaryThreshold or CharacterThreshold properties of CsvDataWriter object are exceeded, separate files are generated to store the binary or character data. These overflow files are located in the same directory as the bulk data file.
If the overflow file contains character data, the character set of the file is governed by the character set specified in the CSV bulk configuration file.
The filename contains the CSV filename and a ".lob" extension (for example, CSV_filename_nnnnnn.lob). These files exist in the same location as the CSV file. Increments start at _000001.lob.
Bulk Copy Operations and Transactions
By default, bulk copy operations are performed as isolated operations and are not part of a transaction. This means there is no opportunity for rolling the operation back if an error occurs.
Some database servers, such as Oracle and DB2, allow bulk copy operations to take place within an existing transaction. You can define the bulk copy operation to be part of a transaction that occurs in multiple steps. Using this approach enables you to perform more than one bulk copy operation within the same transaction, and commit or roll back the entire transaction.
Refer to the Microsoft online help topic "Transaction and Bulk Copy Operations (ADO.NET)" for information about rolling back all or part of the bulk copy operation when an error occurs.