Data Warehouse Center Administration Guide

Chapter 8. Calculating statistics

Use statistical transformers to perform the following statistical functions:


ANOVA transformer

Use the Analysis of Variance (ANOVA) transformer to produce statistical calculations in two tables based on a small number of parameters. There are three types of ANOVA: one-way ANOVA, two-way ANOVA, and three-way ANOVA.

For more information about the ANOVA types, see the online help.

The ANOVA transformer obtains two independent estimates of variance. The first estimate is based on variability between groups. The second estimate is based on variability within groups. After the ANOVA transformer computes these estimates, it calculates their ratio. A family of distributions, the Fisher-F distributions, describes the significance of this ratio.

This transformer also calculates a p-value. The p-value is the probability that the means of the two groups are equal. A small p-value leads to the conclusion that the means are different. For example, a p-value of 0.02 means that there is a 2% chance that the sample means are equal. Likewise, a large p-value leads to the conclusion that the means of the two groups are not different.

You can use this step only with tables that exist in the same database. Use a warehouse source or target table as a source for the ANOVA transformer and up to two warehouse target tables as targets for the ANOVA statistical calculations. If you do not want to select a target table for the ANOVA transformation, you can specify that the ANOVA transformer creates tables on the target database. The Parameters page will not be available for this step subtype until you link the step to a source in the Process Model window.


Figure db2db037 not displayed.

Each time you run a step using this transformer, the existing data is replaced. The ANOVA transformer drops the existing database table and recreates it during each run.

You can make changes to the step only when the step is in development mode.

To define an ANOVA transformer:

  1. Open the step notebook for the ANOVA transformer.
  2. Specify general information about the transformer. For more information, see Providing general information about a step.
  3. On the Parameters page, select the column or columns from the Available Columns field that will be the Grouping columns for the ANOVA transformer. Depending on the type of ANOVA transformer that you want to create, select one, two, or three columns, and click > next to the Grouping columns field.


    Figure db2db038 not displayed.

  4. From the Available Columns field, click a column name, and click > next to the Data column field. This column cannot be used as a grouping column. This column must contain numeric data.
  5. Select target tables for your transformer:
  6. Optional: On the Column Mapping page, you can view the mappings between the output columns that result from the transformations that you defined on the Parameters page and the columns on your target table. You cannot change these mappings. If the output table is not used by any steps that are in test or production mode, you can rename target columns. To rename a target column, double-click the column name and type the new name.
  7. On the Processing Options page, provide information about how your step processes. For more information, see Defining processing options.
  8. Click OK to save your changes and close the step notebook.

Calculate Statistics transformer

Use the Calculate Statistics transformer to calculate the following descriptive statistics on any number of data columns from a single table:

To use the Calculate Statistics transformer, connect the step to a warehouse source and a warehouse target that exist in the same database. Or, you can connect the step to a warehouse source and specify that the step create a target table for you in the same database.


Figure db2db039 not displayed.

You can make changes to the step only when the step is in development mode.

To define a Calculate Statistics transformer step:

  1. Open the step notebook for the Calculate statistics transformer.
  2. Specify general information about the transformer. For more information, see Providing general information about a step.
  3. Click the Parameters tab.


    Figure db2db040 not displayed.

  4. Optional: From the Available columns list, select any columns that you want to use as grouping columns, and click >. Grouping columns can contain character or numeric data.
  5. Define statistics calculations:
    1. Right-click anywhere on the Transformer definition table, and click Add. A row is added to the table.
    2. Under the Data column heading in the row that you just created, click and select a column for which to you want to calculate statistics.
    3. Double-click the Statistics field. The ... push button is available.
    4. Click the ... push button. The Calculate Statistics - Select Statistics window opens.


      Figure db2db041 not displayed.

    5. From the Available statistics list, click one or more statistics to run on the selected column. Then, click >. The type of statistics that you can select from will depend on the data type of the input column. For character data types, only the Count function is available.
    6. Repeat steps 5a-e, as appropriate.
    7. Click OK. The Elementary Statistics - Select Statistics window closes.

      After you define a statistics calculation, it cannot be changed. You must delete unwanted definitions and redefine them. The Calculate Statistics transformer supports partial data. For example, if you select a column to define statistics for, but you don't select the statistics for it, the Calculate Statistics transformer will save your column selection. However, you cannot map columns for a row that has a partial data selection, nor can you successfully run a step that has a partial data selection.

  6. On the Column Mapping page, map the output columns that result from your statistical calculations to columns in your target table. The column names for your statistical calculations are based on the data column that you select on the Parameters page and the statistic that you select for it. A column is created for each statistic that is selected for a data column. For example, if your data column, Sales, has the statistics "Sum" and "Average" defined to it, the columns Sales_sum and Sales_average will be displayed on the Column Mappings page.

    Output columns are listed on the left side of the page, under the heading Source Columns. Target columns from the output table linked to the step are listed on the right side of the page.

    Use the Column Mapping page to perform the following tasks:

    If the Parameters page produces no output columns, or if this step is not linked to a target table and you have not specified automatic generation of a default table in the Parameters page, you will not be able to use this page to map your columns. Some steps will not allow you to change the column mapping.

  7. On the Processing Options page, provide information about how your step processes. For more information, see Defining processing options.
  8. Click OK to save your changes and close the step notebook.

Calculate Subtotals transformer

Use the Calculate subtotals transformer to calculate the running subtotal for a set of numeric values grouped by a period of time, either weekly, semimonthly, monthly, quarterly, or annually. For example, for accounting purposes, it is often necessary to produce subtotals of numeric values for basic periods of time. This is most frequently encountered in payroll calculations where companies are required to produce month-to-date and year-to-date subtotals for various types of payroll data.

The Calculate Subtotals transformer uses a warehouse target table as a source. The table that you use as a source must contain a primary key. The transformer writes to a table on the same database. Before you define this step, link the warehouse target to the step in the Process Model window, with the arrow pointing towards the step. The Parameters page will not be available for this step subtype until you link it to a source in the Process Model window. You can make changes to the step only when the step is in development mode.


Figure db2db042 not displayed.

To define a Calculate Subtotals transformer:

  1. Open the step notebook for the Calculate subtotals transformer.
  2. Specify general information about the transformer. For more information, see Providing general information about a step.
  3. Click the Parameters tab.


    Figure db2db043 not displayed.

  4. Optional: Click columns from the Available columns list to order the output data. Then, click > next to the Order by columns list.
  5. From the Available columns list, click the date column in the input table. The column data type must be DATE or TIMESTAMP. Then, click > next to the Date column field.
  6. Define a subtotal calculation:
    1. Right-click anywhere on the Transformer definition table, and click Add. A row is added to the table.
    2. Under the Input column heading in the row you just added, left-click and select the name of the column on which to calculate a subtotal. Only available numeric columns are listed. A numeric column is not listed if it is being written to by another row or if it is selected as a grouping column. If you use a column as an input in two or more rows, you cannot use the column as an output column.
    3. Under the Period column heading, left-click and select the period for which to calculate.
    4. Repeat steps 6a-c to define additional subtotal calculations.
  7. Use the Column Mapping page to map output columns from your transformer to columns on your target table.

    Columns that result from your transformation are listed on the left side of the page, under the heading Source Columns. These columns are named after the selections you made for the Input column and the Period column on the Parameters page. For example, if you selected SALARY as the input column and Month for the period on the Parameters page, you will see "SALARY_Month" displayed on the left side of the Column Mapping page. Columns in your target table are listed on the right side of the page.

    Restrictions:

    The following restrictions apply to a single instance of a step. For example, you have step 1 and step 2. In step 1, you can use column B as a source column. In step 2, you can use column B as a target column.

    Use the Column Mapping page to perform the following tasks:

    If the Parameters page produces no output columns, or if this step is not linked to a target table and you have not specified automatic generation of a default table in the Parameters page, you will not be able to use this page to map your columns. Some steps will not allow you to change the column mapping.

  8. On the Processing Options page, provide information about how your step processes. For more information, see Defining processing options.
  9. Click OK to save your changes and close the step notebook.

Chi-square transformer

Use the Chi-Square transformer to perform the chi-square test and the chi-square goodness-of-fit test on columns of numerical data. These tests are nonparametric tests.

You can use the statistical results of these tests to make the following determinations:

Use these tests with small sample sizes or when the variables that you are considering might not be normally distributed. Both the chi-square test and the chi-square goodness-of-fit test make the best use of data that cannot be precisely measured.

The Chi-square transformer optionally produces an additional output table called the Expected value output table. You can select a table to be used as the Expected value output table, or you can specify that the table will not be produced.

When you set up this process in the Process Model window, link the Chi-Square step to a warehouse target table. If you want the step to produce the Expected value output table, link the step to a second warehouse target table in the same database.


Figure db2db044 not displayed.

You can change the step definition only when the step is in development mode.

To define a Chi-square transformer:

  1. Open the step notebook.
  2. Specify general information about the transformer. For more information, see Providing general information about a step.
  3. In the Available columns list on the Parameters page, click a column. Then, click > next to the Column of row definition field. This field is required for both goodness-of-fit calculation and chi-square calculations. If you want your step to run as a chi-square calculation, go to step 4. Otherwise, go to step 5. Your step will run as a goodness-of-fit calculation.


    Figure db2db045 not displayed.

  4. To define a chi-square calculation, click a column in the Available columns list, and click > next to the Column of column names field.
  5. In the Available columns list, click a column that contains the observed frequencies data. This column must be of numeric type. Then, click > next to the Observed frequencies column field.
  6. In the Available columns list, click a column that contains expected frequency data. This column must be of numeric type. Then, click > next to the Expected frequencies column field. This field is required for goodness-of-fit calculations and optional for chi-square calculations. If you specify an expected frequencies column, you cannot use the expected values output table.
  7. In the Expected values output table list, select a target table for the expected values output table. Depending on certain conditions, this field is either optional or required:
  8. Optional: On the Column Mapping page, you can view the mappings between the output columns that result from the transformations that you defined on the Parameters page and the columns on your target table. You cannot change these mappings. If the output table is not used by any steps that are in test or production mode, you can rename target columns. To rename a target column, double-click the column name and type the new name.
  9. On the Processing Options page, provide information about how your step processes. For more information, see Defining processing options.
  10. Click OK to save your changes and close the step notebook.

Correlation transformer

Use the Correlation transformer to determine the extent to which changes in the value of an attribute (such as length of employment) are associated with changes in another attribute (such as salary). The data for a correlation analysis consists of two input columns. Each column contains values for one of the attributes of interest. The Correlation transformer can calculate various measures of association between the two input columns. You can select more than one statistic to calculate for a given pair of input columns.

The data in the input columns also can be treated as a sample obtained from a larger population, and the Correlation transformer can be used to test whether the attributes are correlated in the population. In this context, the null hypothesis asserts that the two attributes are not correlated, and the alternative hypothesis asserts that the attributes are correlated.

The Correlation transformer calculates any of the following correlation-related statistics on one or more pairs of columns:

Correlation coefficient r
The correlation coefficient r is a measure of the linear relationship between two attributes or columns of data. The correlation coefficient is also known as the Pearson product-moment correlation coefficient. The value of r can range from -1 to +1 and is independent of the units of measurement. A value of r near 0 indicates little correlation between attributes; a value near +1 or -1 indicates a high level of correlation.

When two attributes have a positive correlation coefficient, an increase in the value of one attribute indicates a likely increase in the value of the second attribute. A correlation coefficient of less than 0 indicates a negative correlation. That is, when one attribute shows an increase in value, the other attribute tends to show a decrease.

Consider two variables x and y:

Covariance
Covariance is a measure of the linear relationship between two attributes or columns of data. The value of the covariance can range from -infinity to +infinity. However, if the value of the covariance is too small or too large to be represented by a number, the value is represented by NULL.

Unlike the correlation coefficient, the covariance is dependent on the units of measurement. For example, measuring values of two attributes in inches rather than feet increases the covariance by a factor of 144.

T-value
T-value is the observed value of the T-statistic that is used to test the hypothesis that two attributes are correlated. The T-value can range between -infinity and +infinity. A T-value near 0 is evidence for the null hypothesis that there is no correlation between the attributes. A T-value far from 0 (either positive or negative) is evidence for the alternative hypothesis that there is correlation between the attributes.

The definition of T-statistic is:

T = r * SQRT((n-2) / (1 - r*r))

where r is the correlation coefficient, n is the number of input value pairs, and SQRT is the square root function.

If the correlation coefficient r is either -1 or +1, the T-value is represented by NULL. If the T-value is too small or too large to be represented by a number, the value is represented by NULL.

P-value
P-value is the probability, when the null hypothesis is true, that the absolute value of the T-statistic would equal or exceed the observed value (T-value). A small P-value is evidence that the null hypothesis is false and the attributes are, in fact, correlated.

Your source table and target table must exist in the warehouse database. This transformer can create a target table in the same warehouse database that contains the source, if you want it to. You can change the step only when the step is in development mode.


Figure db2db046 not displayed.

To define a Correlation transformer step:

  1. Open the step notebook.
  2. Specify general information about the transformer. For more information, see Providing general information about a step.
  3. Click the Parameters tab.


    Figure db2db047 not displayed.

  4. Optional: Click columns that you want to use as grouping columns and click >. Grouping columns can contain character or numeric data.
  5. Define correlation statistics:
    1. Right-click anywhere on the Transformer definition table, and click Add. A row is added to the table.
    2. Under the Data Column 1 heading in the row that you just added, click and select the first column for the calculation. Only columns of numeric type are listed.
    3. Under the Data Column 2 heading, click and select the second column for the calculation. Only columns of numeric type are listed. You cannot use the same column in Data Column 2 that you use in Data Column 1.
    4. Double-click under the Statistics heading in the row you just created. The ... push button is available.
    5. Click the ... push button. The Correlation - Select Statistics window opens.
    6. Click one or more statistics from the Available statistics list. Then, click >. The statistics are moved to the Selected statistics list.
    7. Repeat steps 5a-f, as appropriate.
    8. Click OK. The Correlation - Select Statistics window closes.

      The Correlation transformer supports partial data. For example, if you select a column to define statistics for, but don't select the statistics for it, the Correlation transformer will save your column selection. However, you cannot map columns for a row that has a partial data selection, nor can you successfully run a step that has a partial data selection.

  6. On the Column Mapping page, map the columns that result from your correlation statistics to columns in your target table.

    The column names for your correlation statistics are based on the data column entries that you select on the Parameters page and the statistic that you select for it. A column is created for each statistic that is selected and its corresponding data columns. For example, if your data columns, Salary and Employment, have the correlation statistics Covariance and T-value defined to them, the columns Covariance_Salary_Employment and T-value_Salary_Employment will be displayed on the Column Mappings page. Output columns are listed on the left side of the page, under the heading Source Columns. Target columns from the output table linked to the step are listed on the right side of the page. Use the Column Mapping page to perform the following tasks:

    If the target column is mapped to a source column, the mapping remains intact. If the Parameters page produces no output columns, or if this step is not linked to a target table and you have not specified automatic generation of a default table in the Parameters page, you will not be able to use this page to map your columns. Some steps will not allow you to change the column mapping.

  7. On the Processing Options page, provide information about how your step processes. For more information, see Defining processing options.
  8. Click OK to save your changes and close the step notebook.

Moving Average transformer

Simple and exponentially smoothed moving averages can often predict the future course of a time-related series of values. Moving averages are widely used in time-series analysis in business and financial forecasting. Rolling sums have other widely used financial uses.

You can use the Moving Average transformer to calculate the following values:

An exponential moving average is also known as an exponentially smoothed moving average.

Moving averages redistribute events that occur briefly over a wider period of time. This redistribution serves to remove noise, random occurrences, large peaks, and valleys from time-series data. You can apply the moving average method to a time-series data set to:

The Moving Average transformer uses a warehouse target table as a source. The table that you use as a source must contain a primary key. The transformer writes to a table on the warehouse target. Before you define this step, link the warehouse target to the step in the Process Model window, with the arrow pointing towards the step.


Figure db2db048 not displayed.

To define a Moving Average transformer:

  1. Open the step notebook.
  2. Specify general information about the transformer. For more information, see Providing general information about a step.
  3. On the Parameters page, in the Available Columns field, click columns that you want to use as grouping columns, then click >. The columns are moved to the Order by columns field. Grouping columns can contain character or numeric data.


    Figure db2db049 not displayed.

  4. For each column listed in the Order by columns field, click a sort order from the Sort list. Select Ascending for ascending order. Select Descending for descending order.
  5. Define moving average calculations:
    1. Right-click anywhere on the Transformer definition table, and click Add. A row is added to the table.
    2. Under the Input column heading in the row that you just added, left-click and select a column for which to calculate a moving average. Only numeric columns are listed. A numeric column is not listed if it is being written to by another row, or if it is selected as a grouping column. If you use a column as an input in 2 or more rows, you cannot use the column as an output column.
    3. Under the Statistics column heading, left-click and select the type of moving average to calculate.

      In the Moving Average transformer, you can select one of three statistics:

      • A simple moving average. A simple moving average (SMA) is defined as:

        SMA [row i] = SUM (data values for last N rows) / N

      • An exponential moving average. An exponential moving average (EMA) is defined as:

        EMA [row i] = ( Value [row i] * K ) + ( EMA [row i-1] * ( 1-K ) ) where: K = 2/(N+1)

        For example: For a four-day EMA, N = 4, K = 2/5 and 1-K = 3/5. An exponential moving average is also known as an exponentially smoothed moving average.

      • A rolling sum for N periods of data, where N is specified by the user. A rolling sum (RS) is defined as:

        RS [row i] = SUM ( data values for the last N rows)

    4. Under the Period column heading, double-click the field, then type the number of days for which the moving average will be calculated. For example, if you want to calculate the moving average for one week, type 7. You can only type digits. Other keystrokes are ignored.
    5. Repeat steps 5a-d, as appropriate.
  6. On the Column Mapping page, map output columns from your transformer to columns on your target table.

    Columns that result from your transformation are listed on the left side of the page. These columns are named after the selections you made for the Input column, the Statistics column, and the Period column on the Parameters page. For example, if you selected SALARY as the input column, EMA as the calculation, and 7 for the period on the Parameters page, you will see SALARY_EMA_7 displayed on the left side of the Column Mapping page .

    Restrictions:

    These restrictions apply to a single instance of a step. For example, you have step 1 and step 2. In step 1, you can use column B as a source column. In step 2, you can use column B as a target column.

    Use the Column Mapping page to perform the following tasks:

    If the Parameters page produces no output columns, or if this step is not linked to a target table and you have not specified automatic generation of a default table in the Parameters page, you will not be able to use this page to map your columns. Some steps will not allow you to change the column mapping.

  7. On the Processing Options page, provide information about how your step processes. For more information, see Defining processing options.
  8. Click OK to save your changes and close the step notebook.

Regression transformer

Use the Regression transformer to identify the relationships between a dependent variable and one or more independent variables, and to show how closely they are correlated. You can use this transformer to show the effect of a change in pricing on demand for a product, to show the effect of location on the response to advertising, or to show how closely two seemingly random sets of data are related.

This transformer performs a backward, full-model regression. This method starts with all independent variables in a model, but removes the least-important independent variables one at a time until only significant independent variables remain in the model.

The Regression transformer produces two additional output tables: the ANOVA summary table and the Equation variable table.

Before you begin this task, you must link this step in the Process Model window to a warehouse source table and three warehouse target tables. Or, you can link the step to a source and specify that the step create the target tables. The tables must exist in the same database. The Regression transformer writes the results from the Regression transformation to a table on one warehouse target, and creates the ANOVA summary table and the Equation variable table on the second and third targets. You can make changes to the step only when the step is in development mode.


Figure db2db050 not displayed.

To define a Regression transformer:

  1. Open the step notebook.
  2. Specify general information about the transformer. For more information, see Providing general information about a step.
  3. On the Parameters page, select columns from the Available columns list that contain independent variable data used for prediction in the regression calculations. Then, click > next to the Predictor columns list. The columns are added to the Predictor columns list. Only columns of numeric data type are available.


    Figure db2db051 not displayed.

  4. From the Available columns list, click one column that contains the dependent variable data to use for criteria in the regression calculations. Then, click > next to the Criterion column field. Only columns of numeric data type are available.
  5. In the Summary table list, select a target table to be your ANOVA summary table.
  6. In the Equation variable table list, select a target table to be your Equation variable table.
  7. Optional: On the Column Mapping page, you can view the mappings between the output columns that result from the transformations that you defined on the Parameters page and the columns on your target table. You cannot change these mappings. If the output table is not used by any steps that are in test or production mode, you can rename target columns. To rename a target column, double-click the column name and type the new name.
  8. On the Processing Options page, provide information about how your step processes. For more information, see Defining processing options.
  9. Click OK to save your changes and close the step notebook.


[ Top of Page | Previous Page | Next Page ]