>>-+-REGR_AVGX----------+--(----expression1--,--expression2--)-->< +-REGR_AVGY----------+ +-REGR_COUNT---------+ +-+-REGR_INTERCEPT-+-+ | '-REGR_ICPT------' | +-REGR_R2------------+ +-REGR_SLOPE---------+ +-REGR_SXX-----------+ +-REGR_SXY-----------+ '-REGR_SYY-----------'
The schema is SYSIBM.
The regression functions support the fitting of an ordinary-least-squares regression line of the form y = a * x + b to a set of number pairs. The first element of each pair (expression1) is interpreted as a value of the dependent variable (i.e., a "y value"). The second element of each pair (expression2 ) is interpreted as a value of the independent variable (i.e., an "x value").
The function REGR_COUNT returns the number of non-null number pairs used to fit the regression line (see below).
The function REGR_INTERCEPT (the short form is REGR_ICPT) returns the y-intercept of the regression line ("b" in the above equation)
The function REGR_R2 returns the coefficient of determination (also called "R-squared" or "goodness-of-fit") for the regression.
The function REGR_SLOPE returns the slope of the line (the parameter "a" in the above equation).
The functions REGR_AVGX, REGR_AVGY, REGR_SXX, REGR_SYY, and REGR_SXY return quantities that can be used to compute various diagnostic statistics needed for the evaluation of the quality and statistical validity of the regression model (see below).
The argument values must be numbers.
The data type of the result of REGR_COUNT is integer. For the remaining functions, the data type of the result is double-precision floating point. The result can be null. When not null, the result of REGR_R2 is between 0 and 1 and the result of both REGR_SXX and REGR_SYY is non-negative.
Each function is applied to the set of (expression1, expression2) pairs derived from the argument values by the elimination of all pairs for which either expression1 or expression2 is null.
If the set is not empty and VARIANCE(expression2) is positive, REGR_COUNT returns the number of non-null pairs in the set, and the remaining functions return results that are defined as follows:
REGR_SLOPE(expression1,expression2) = COVARIANCE(expression1,expression2)/VARIANCE(expression2)
REGR_INTERCEPT(expression1, expression2) = AVG(expression1) - REGR_SLOPE(expression1, expression2) * AVG(expression2)
REGR_R2(expression1, expression2) = POWER(CORRELATION(expression1, expression2), 2) if VARIANCE(expression1)>0 REGR_R2(expression1, expression2) = 1 if VARIANCE(expression1)=0
REGR_AVGX(expression1, expression2) = AVG(expression2)
REGR_AVGY(expression1, expression2) = AVG(expression1)
REGR_SXX(expression1, expression2) = REGR_COUNT(expression1, expression2) * VARIANCE(expression2)
REGR_SYY(expression1, expression2) = REGR_COUNT(expression1, expression2) * VARIANCE(expression1)
REGR_SXY(expression1, expression2) = REGR_COUNT(expression1, expression2) * COVARIANCE(expression1, expression2)
If the set is not empty and VARIANCE(expression2) is equal to zero, then the regression line either has infinite slope or is undefined. In this case, the functions REGR_SLOPE, REGR_INTERCEPT, and REGR_R2 each return a null value, and the remaining functions return values as defined above. If the set is empty, REGR_COUNT returns zero and the remaining functions return a null value.
The order in which the values are aggregated is undefined, but every intermediate result must be within the range of the result data type.
The regression functions are all computed simultaneously during a single pass through the data. In general, it is more efficient to use the regression functions to compute the statistics needed for a regression analysis than to perform the equivalent computations using ordinary column functions such as AVERAGE, VARIANCE, COVARIANCE, and so forth.
The usual diagnostic statistics that accompany a linear-regression analysis can be computed in terms of the above functions. For example:
Example:
SELECT REGR_SLOPE(BONUS,SALARY), REGR_INTERCEPT(BONUS,SALARY), REGR_R2(BONUS,SALARY), REGR_COUNT(BONUS,SALARY), REGR_AVGX(BONUS,SALARY), REGR_AVGY(BONUS,SALARY), REGR_SXX(BONUS,SALARY), REGR_SYY(BONUS,SALARY), REGR_SXY(BONUS,SALARY) INTO :SLOPE, :ICPT, :RSQR, :CNT, :AVGSAL, :AVGBONUS, :SXX, :SYY, :SXY FROM EMPLOYEE WHERE WORKDEPT = 'A00'
When using the sample table, the host variables are set to the following approximate values:
SLOPE: +1.71002671916749E-002 ICPT: +1.00871888623260E+002 RSQR: +9.99707928128685E-001 CNT: 3 AVGSAL: +4.28333333333333E+004 AVGBONUS: +8.33333333333333E+002 SXX: +2.96291666666667E+008 SYY: +8.66666666666667E+004 SXY: +5.06666666666667E+006