This chapter describes the methods of transforming data that the Data Warehouse Center provides. It describes how to use SQL steps and warehouse transformers to transform data.
In many cases, there are multiple ways that you can transform data.
For example, if you want to clean data, you have the following choices:
Table 16. Methods of cleaning data
Method | Description | For more information, see: |
---|---|---|
WHERE clause | Build an SQL WHERE clause to limit the rows that you extract from the source table. | Filtering data |
Formulas and expressions | Use formulas and expressions to eliminate the data that you do not need and to create the data that you do need. Use the Expression window in SQL Assist to specify formulas, constants, and tokens. | Adding calculated columns |
Clean Data transformer | Use the Clean Data transformer to perform rules-based find and replace operations on a table. | Cleaning data |
Warehouse programs | Use warehouse programs to perform any functions on the source data that are not available from the previous methods. | Chapter 11, Extending the Data Warehouse Center with user-defined programs |
If you select multiple sources, you also need to specify how the rows of these columns will be associated in the join of the tables. You can use SQL Assist to define the join.
You must link the sources to the step before you define the join.
To join source tables:
SQL Assist opens.
The column moves from the Available columns list to the Selected columns list.
Click >> to add all the columns from the Available columns list to the Selected columns list.
If the columns have compatible data types, a gray line is displayed, connecting the columns, and the Join button is available.
If the columns do not have compatible data types, an error message is displayed in the status area at the bottom of the window.
SQL Assist draws a red line between the selected columns, which indicates that the tables are joined on that column.
To remove a join:
SQL Assist closes.
Tip: | The source tables must exist for you to use the Test push button on the SQL Statement page. If you specified that the Data Warehouse Center is to create the tables, you must promote to test mode the steps that link to those tables as target tables to create the tables. |
The Step Properties notebook closes.
In many production environments, the source data includes encoded information. For example, you might use codes to refer to part numbers throughout your database. You also have a table that relates the part numbers to serial numbers and descriptions of each part. In these situations, you want the information in your warehouse to include the name and description of each part. To do this, you must combine the decoding table with the source data that contains the encoded part numbers.
First, you need to define the decoding table and the encoded part numbers table as part of a warehouse source. Then, select those tables as source tables for a step. You then click Join on the Joins page of SQL Assist to join the tables.
Another approach is to use a CASE statement to decode the data. For example, data in a month column is coded numerically, and you want to transform the data to character strings that contain the abbreviation of the month. You issue the following statement:
CASE TBC.ORDER_HISTORY.ORDERMONTH WHEN 1 THEN 'Jan' WHEN 2 THEN 'Feb' WHEN 3 THEN 'Mar' WHEN 4 THEN 'Apr' WHEN 5 THEN 'May' WHEN 6 THEN 'Jun' WHEN 7 THEN 'Jul' WHEN 8 THEN 'Aug' WHEN 9 THEN 'Sep' WHEN 10 THEN 'Oct' WHEN 11 THEN 'Nov' WHEN 12 THEN 'Dec' END
By default, a join is assumed to be an inner join. You can also request other types of joins by clicking Join Type on the Joins page of SQL Assist. The following types of joins are available:
If your database supports the OUTER JOIN keywords, you can extend the inner join to add rows from one table that have no matching rows in the other table.
For example, you want to join two tables to get the last name of the manager for each department. The first table is a Department table that lists the employee number of each department manager. The second table is an Employee table that lists the employee number and last name of each employee. However, some departments do not have a manager; in these cases, the employee number of the department manager is null. To include all departments regardless of whether they have a manager, and the last name of the manager, if one exists, you generate a left outer join. The left outer join includes rows in the first table that match the second table or are null. The resulting SQL statement is as follows:
SELECT DEPTNO, DEPTNAME, EMPNO, LASTNAME FROM DEPARTMENT LEFT OUTER JOIN EMPLOYEE ON MGRNO = EMPNO
A right outer join is the same as a left outer join, except that it includes rows in the second table that match the first table or are null. A full outer join includes matching rows and null rows from both tables.
For example, you have two tables, Table 1 and Table 2, with the following
data:
Table 1 | |
---|---|
Column A | Column B |
1 | A |
2 | B |
3 | C |
Table 2 | |
---|---|
Column C | Column D |
2 | X |
4 | 2 |
You specify a join condition of Column A = Column C. The result tables for the different types of joins are as follows:
Column A | Column B | Column C | Column D |
---|---|---|---|
2 | B | 2 | X |
Column A | Column B | Column C | Column D |
---|---|---|---|
1 | A | null | null |
2 | B | 2 | X |
3 | C | null | null |
Column A | Column B | Column C | Column D |
---|---|---|---|
2 | B | 2 | X |
null | null | 4 | 2 |
Column A | Column B | Column C | Column D |
---|---|---|---|
1 | A | null | null |
2 | B | 2 | X |
3 | C | null | null |
null | null | 4 | 2 |
If you specify value (a,c), you obtain the following result:
1 |
2 |
3 |
4 |
You can generate a star join, which is a join of source tables that are defined in a star schema. A star schema is a specialized design that consists of the following types of tables:
For example, if you have a mail-order business that sells books, some dimension tables are Customers, Books, Catalogs, and Fiscal_Years. The fact table contains information about the books that were ordered from each catalog by each customer during the fiscal year.
Each dimension table contains a primary key, which is one or more columns that you select to identify a row in the table. The fact table contains foreign keys that correspond to the primary keys in the dimension table. A foreign key is a column in a table whose allowable values must exist as the primary key for another table.
When you request a star join, the Data Warehouse Center joins the primary keys of the dimension tables with the foreign keys of the fact table. In the previous example, the Customers table has a primary key of Customer Number, and each book has a primary key of its Book Number (ISBN). Each order in each table contains foreign keys of Customer Number and Book Number. The star join combines information about the customers and books with the orders.
For information about defining primary and foreign keys in the Data Warehouse Center, see Defining a warehouse target. For information about defining a star schema and exporting it to OLAP Integration Server, see Chapter 13, Creating a star schema from within the Data Warehouse Center.
In most cases, when you create a step, you want only a subset of the source data. You might want to extract only the rows that meet certain criteria. You can use the Data Warehouse Center to build an SQL WHERE clause to limit the rows that you extract from the source table.
For example, you can define a step that selects rows from the most recent edition of the source table:
WHERE TBC.ORDER_HISTORY.RUN_ID = &cur_edtn.IWHDATA.TBC.ORDER_HISTORY
The RUN_ID column contains information about the step edition. The &cur_edtn token represents the current step edition. Therefore, this WHERE clause selects rows in which the step edition equals the current edition.
To build the WHERE clause, use the Conditions page of SQL Assist.
To exclude duplicate rows, select the Exclude duplicate rows (SELECT DISTINCT) check box. This action eliminates all but one of each set of duplicate rows in the result set.
To specify search conditions:
SQL Assist opens.
If you selected the Between operator in the Operators list, you must enter two values per line in the Values list. The values are separated by an ampersand (&).
You can search for appropriate values by clicking Find.
You can specify host variables in the Values list. To do this, click on a line in the Values list, then click Add Variable. The Add Variable window opens. Type the variable name, then click OK. The Add Variables window closes, and the variable is added to the Values list on the Conditions page.
You can specify parameters in the Values list. If a parameter is specified, its value is used in the search condition. A parameter is specified in the format :parm, where parm is the parameter name. For example, :empid is a valid specification for a parameter named empid.
You can edit the text in the Conditions field. You can remove search conditions by highlighting the portion of the condition that you want to remove in the Conditions field, and pressing the Delete key on your keyboard.
To build complex conditions, click Advanced Expression. The Expression Builder - Conditions window opens. For help on using the Expression Builder, see "Building expressions" in the online help.
You also define some columns that are calculated from values of other columns. For example, you need only the month in which an item was ordered. You can use the SQL DATE function to convert the order date to the DATE data type format. Then, you use the MONTH function to return the month part of the date. The SQL statement for the calculated column is as follows:
MONTH(DATE(TBC.ORDERS_MONTH.OrderDate))
You can also use calculated columns to summarize data. In many situations, your source data contains far more detail than you want to replicate into your warehouse. All you need from the source data is a summary of some kind. You might need an average, a summary, or a count of the elements in the source database, but you do not need all the data.
The Data Warehouse Center lets you easily and accurately define steps that are summaries of source data. You can use standard SQL aggregation functions (AVG, COUNT, MAX, MIN, and SUM) and the SQL GROUP BY clause to create steps that summarize the source data.
Summary steps reduce the load on the network. They perform the aggregations on the source data before replicating the data across the network. You can also create composite steps that use summary techniques to summarize other steps. Summarizing reduces the size of the target warehouse that you create.
To create a step with this type of summary, click the SUM function in the Functions field of the Expression Builder window of SQL Assist.
For example, a step summarizes all the items that are sold in a month and expresses the amount in thousands of dollars:
SUM(TBC.ITEMS_MONTH.Amount)/1000
To define a calculated column:
SQL Assist opens.
The Expression Builder window opens.
You can type your expression in the Expression field, or you can use the fields and controls in the Expression Builder to build your expression. To build an expression:
Use warehouse transformers to perform the following basic data transformations:
Use the Clean Data transformer to perform rules-based find and replace operations on a table. The transformer finds values that you specify in the data columns of the source table that your step accesses. Then the transformer updates corresponding columns with replacement values that you specify in the table to which your step writes. You can select multiple columns from the input table to carry over to the output table. The Clean Data transformer does not define rules or parameters for the carry over columns.
Use the Clean Data transformer to clean and standardize data values after load or import, as part of a process. Do not use this transformer as a general-purpose data column editor.
You can use the Clean Data transformer to perform the following tasks:
You can use the Clean Data transformer only if your source table and target table are in the same database. The source table must be a single warehouse table. The target table is the default target table.
You can choose to ignore case and white space when locating strings, and you can specify a tolerance value for numeric data.
You can make changes to the step only when the step is in development mode.
Each clean transformation that you specify uses one of four clean types:
Prerequisite: Before you can use the Clean Data transformer, you must create a rules table for your clean type. A rules table designates the values that the Clean Data transformer will use during the find and replace process. The rules table must be in the same database as the source table and target table.
At a minimum, a rules table must contain at least two columns. One column contains find values. The other column contains replace values. The rows in each column correspond to each other.
For example, Column 1 and Column 2 in a rules table have the values shown
here:
Column 1 | Column 2 |
Desk | Chair |
Table | Lamp |
Suppose that Column 1 contains the find values, and Column 2 contains the replace values. When you run the step, the Clean Data transformer searches your source column for the value Desk. Wherever it finds the value Desk, it writes the value Chair in the corresponding field of the target column.
The Clean Data transformer copies values that are not listed in the find column directly to the target table. In the example, the value Stool is not listed in the column that contains the find values. If the selected source column contains the value Stool, the Clean transformer will write Stool to the corresponding field in the target column.
The following table describes the columns that must be included in the
rules table for each clean type:
Clean type | Minimum number of columns in rules table | Purpose of columns |
Find and replace | 2 - 3 |
The first column contains find values. The second column contains replace values. The third column, if specified, contains rule-ordering information and is
not required.
|
Discretize | 3 - 4 |
The first column contains the low end of the find value range. The second column contains the high end of the find value range. The third column contains the replacement value. The forth column, if specified, contains rule-ordering information and is
not required.
|
Clip | 3 - 5 |
The first column contains the low end of the find value range. The second column contains the high end of the find value range. The third column contains the low end of the replacement value. The fourth column contains the high end of the replacement value. The fifth column, if specified, contains rule-ordering information and is not required. Tip: When you use the Clip clean type, you can run a find
and replace procedure that searches outside of value ranges.
|
Carry over | None allowed | Rule table not used. |
You can reorder the output columns using the Step notebook. You can change column names on the Column Mapping page of the Step notebook.
To define a Clean Data transformer:
To copy your input column to the target table without making any changes, click Carry Over. The Rules list and Parameter list are not available. Proceed to step 7.
Restrictions:
Note: You can use character string or numeric data types. The comparison is performed in the locale specific order for the character set that you are using.
Note: You can use the Clip clean type for numeric data types only. To avoid lower or upper bound clipping, specify the database null value in the rules table. The database null value can also be used as the replacement value.
Use the controls in the String Parameters window to customize how the Clean Data transformer performs the find and replace tasks. To specify information about string parameters:
Use the controls in the Numeric Parameters window to specify a numeric tolerance for your transformation. The default value is 0.0. A numeric tolerance is a number that you apply to numeric find values. The numeric tolerance allows the replacement of a value that is just outside a find value or a range of find values. For example, you use the Discretize clean type to replace the find value numeric ranges in the following manner: 23.50 -24.50 replaced by 1000. If you specify a tolerance of 0.10, then input data of 23.40 and 24.60 would also be replaced by 1000.
To specify a numeric tolerance:
Use the Generate Key Table transformer to add a unique key to a warehouse table.
You can change values in the key column in two ways:
The Generate Key Table transformer uses a warehouse target table as a source. The transformer writes to a table on the warehouse target. Before you define this step, link the warehouse target to the step in the Process Model window, with the arrow pointing towards the step. You can make changes to the step only when the step is in development mode.
Column mapping is not available for this step.
To define a Generate Key table transformer:
To update values in the key column based on a column in a selected step:
To replace all key column values in the existing key column:
Use the Generate Period Table transformer to create a period table that contains columns of date information that you can use when evaluating other data, such as determining sales volume within certain period of time.
You can use a period table in the following ways:
The Generate Period Table transformer works only on target tables. To use the transformer successfully, you must connect the transformer to a target.
You can make changes to the step definition only when the step is in development mode.
To define a Generate Period Table transformer:
Use the Create Rows field as an alternative to setting an end date or time for your period table. In this field, you enter the number of period rows that the transformer will generate in the Period Table. This field is optional.
If you decide not to use this field, then the following conditions are true:
If you do not enter a value in the Create Rows field, you must enter a value or values in the Date and Time fields in the End section.
In these fields, you either specify the date or time (or both) that you want to be last date/time value entered into the period table, based on the following conditions:
Column | Data Type | Notes |
Julian Day | Integer | Dates start January 1, 4712 B.C. |
Hour in Day | Integer | 1 - 24 |
Day in Week | Integer | 1 - 7 (varies by locale, default locale used) |
Day in Month | Integer | 1 - 31 |
Day in Year | Integer | 1 - 366 |
Week in Month | Integer | 1 - 5 (weeks in the month begin on Sundays) |
Week in Year | Integer | 1 - 53 |
Month in Year | Integer | 1 - 12 |
Quarter in Year | Integer | 1 - 4 |
Year | Integer | Nnnn |
Day Name | Varchar | Default locale used |
Month Name | Varchar | Default locale used |
Sequence Number | Integer | This column requires your input in the Sequence number section. If you select this option, the Sequence number controls become available. |
Periodic Number | Integer | This column requires your input in the Periodic number section. If you select this option, the Period number controls become available. |
Use the Invert Data transformer to invert the order of the rows and columns in a table. When you use the Invert Data transformer, the rows in the source table become columns in the output table, and the columns in the input table become rows in the output table. The order of data among the columns, from top to bottom, is maintained and placed in rows, from left to right.
For example, consider the input table as a matrix. This transformer swaps the data in the table around a diagonal line that extends from the upper left of the table to the lower right. Then the transformer writes the transformed data to the target table.
You can specify an additional column that contains ordinal data that starts at the number 1. This column helps you identify the rows after the transformer inverts the table.
You can also specify a column in the source table to be used as column names in the output table. This column is called the pivot column.
Columnar data in each pivot group must be either the same data type or data types that are related to each other through automatic promotion. For more information about automatic promotion of data types, see the IBM DB2 Universal Database: SQL Reference.
Prerequisite: Before you begin this task, you must connect a source table from the warehouse database to the step. You can also specify a target table that the step will write to, or you can designate that the step creates the target table.
The Invert Data transformer drops the existing database table and recreates it during each run. Each time that you run a step using this transformer, the existing data is replaced, but the table space and table index names are preserved.
A step that uses the Invert Data transformer must be promoted to production mode before you can see the actual data that is produced.
This step does not use the Column Mapping page.
To define an Invert Data transformer:
All data in the pivot column must be fewer than 18 characters. Alphabetic characters are changed to uppercase on DB2 systems that require object names in uppercase.
If you don't select a column, the transformer assumes that the table contains raw data only, all of which it should invert. Then, the transformer creates column names for the target table based on the column number. The first column will be named 1, the second column will be named 2, and so on.
Use the Pivot data transformer to group related data from selected columns in the source table, which are called pivot columns, into a single column, called a pivot group column, in the target table. You can create more than one pivot group column.
You can select multiple columns from the source table to carry over to the output table. The data in these columns is not changed by the Pivot Data transformer.
You can specify an additional column that contains ordinal data that starts at the number 1. This column helps you identify the rows after the transformer inverts the table.
Columnar data in each pivot group must have be the same data type or data types that are related to each other through automatic promotion. For more information about automatic promotion of data types, see the IBM DB2 Universal Database: SQL Reference.
Before you begin this task, connect a warehouse source table to the step in the Process Model window. The Pivot Data transformer uses an existing target table in the same database or creates a target table in the same database that contains the warehouse source. You can change the step definition only when the step is in development mode.
To define a transformer:
Use the table in this window to define your pivot groups. The number of rows that are displayed in this table depend on the number of pivot columns that you selected and the number of pivot groups that you specified on the Parameters page. For example, If you selected 6 pivot columns and 2 pivot groups, 3 rows are displayed.
If you make changes to the selected pivot columns in the Pivot Data Transformer window, the data group and pivot group values that you specified in this window will be reset to blank.
Use the FormatDate transformer to change the format of a date field in your source table that your step is copying to the default target table. You can run this transformer with any other transformer or warehouse program.
The FormatDate transformer provides several standard date formats that you can specify for the input and output columns. If a date in the input column does not match the specified format, the transformer writes a null value to the output table.
If the format that you want to use is not displayed in the Format list, you can type a format in the Format string field of the transformer window. For example, type MMM D, YY if the dates in your input column have a structure like Mar 2, 96 or Jul 15, 83.
The data type of the column that is selected in the Input column field determines the data type for the output column. For example:
To use the FormatDate transformer:
The Expression Builder opens.
The Function Arguments - FormatDate window opens.
The Function Arguments - FormatDate window closes, and the expression is displayed in the Expression Builder window.