idaLm {ibmdbR}R Documentation

Linear regression

Description

This function performs linear regression on the contents of an ida.data.frame.

Usage


idaLm(form, idadf, id = "id", modelname = NULL, dropModel = TRUE, limit = 25)

## S3 method for class 'idaLm'
print(x, ...)
## S3 method for class 'idaLm'
predict(object, newdata, id, outtable = NULL, ...)
## S3 method for class 'idaLm'
plot(x, names = TRUE, max_forw = 50, max_plot = 15, order = NULL,
lmgON = FALSE, backwardON = FALSE, ...)

Arguments

form

A formula object that specifies both the name of the column that contains the continuous target variable and either a list of columns separated by plus symbols or a single period (to specify that all other columns in the ida.data.frame are to be used as predictors). The specified columns can contain continuous or categorical values. The specified formula cannot contain transformations.

idadf

An ida.data.frame that contains the input data for the function.

id

The name of the column that contains a unique ID for each row of the input data. An id column needs to be specified, if a model contains categorical values, more than 41 columns or when dropModel is set to FALSE. If no valid id column was specified, a temporary id column will be used (not for DB2 for z/OS).

modelname

Name of the model that will be created in the database.

dropModel

logical: If TRUE the in database model will be dropped after the calculation.

limit

The maximum number of levels for a categorical column. Its default value is 25. This parameter only exists for consistency with older version of idaLm.

x

An object of the class idaLm.

object

An object of the class idaLm

newdata

An ida.data.frame that contains data that will be predicted.

outtable

The name of the table where the results will be written in.

names

logical: If set to TRUE then the plot will contain the names of the attributes instead of numbers.

max_forw

integer: The maximum number of iterations the heuristic forward/backward will be calculated.

max_plot

integer: The maximum number of attributes that will appear in the plot. It must be bigger than 0.

order

Vector of attribute names. The method will calculate the value of the models with the attributes in the order of the vector and plot the value for each of it.

lmgON

logical: If set TRUE the method will calculate the importance metric lmg. This method has exponential runningtime and is not supported for more than 15 attributes

backwardON

logical: If set TRUE the method will calculate the backward heuristic. By default (FALSE) it will do the forward heuristic.

...

Additional parameters.

Details

The idaLm function computes a linear regression model by extracting a covariance matrix and computing its inverse. This implementation is optimized for problems that involve a large number of samples and a relatively small number of predictors. The maximum number of columns is 78.

Missing values in the input table are ignored when calculating the covariance matrix. If this leads to undefined entries in the covariance matrix, the function fails. If the inverse of the covariance matrix cannot be computed (for example, due to correlated predictors), the Moore-Penrose generalized inverse is used instead.

The output of the idaLm function has the following attributes:

$coefficients is a vector with two values. The first value is the slope of the line that best fits the input data; the second value is its y-intercept.

$RSS is the root sum square (that is, the square root of the sum of the squares).

$effects is not used and can be ignored.

$rank is the rank.

$df.residuals is the number of degrees of freedom associated with the residuals.

$coefftab is a is a vector with four values:

$Loglike is the log likelihood ratio.

$AIC is the Akaike information criterion. This is a measure of the relative quality of the model.

$BIC is the Bayesian information criterion. This is used for model selection.

$CovMat the Matrix used in the calculation ("Covariance Matrix"). This matrix is necessary for the Calculation in plot.idaLm and the statistics.

$card the number of dummy variables created for categorical columns and 1 for numericals.

$model the in database modelname of the idaLm object.

$numrow the number of rows of the input table that do not contain NAs.

$sigma the residual standard error.

The plot.idaLm function uses R^2 as a measure of quality of a linear model. R^2 compares the variance of the predicted values and the variance of the actual values of the target variable.

$First: Returns the R^2 value of the linear model for each attribute alone.

$Usefulness: Returns the R^2 value reduction of the linear model with all attributes to the linear model with one attribute taken away.

$Forward_Values: Is only calculated if backwardON=FALSE. This is a heuristic that adds in each step the attribute which has the most R^2 increase.

$LMG: Is only calculated if lmgON=TRUE. It returns the increase of R^2 of each attribute averaged over every possible permutation. By grouping some of the permutations we only need to average over every possible subset. For n attributes there are 2^n subsets. So LMG is an algorithm with exponential runningtime and is not recommended for more than 15 attributes.

$Backward_Values: Is only calculated if backwardON=TRUE. Similar to the forward heuristic. This time we choose in each step of the algorithm that has minimal R^2 reduction when taking it out of the model, starting with all attributes.

$Model_Values: Is only calculated if order is a vector of attributes. In this case the function calculates the R^2 value for the models that we get when we add one attribute of order in each step.

RelImpPlot.png: If lmgON=FALSE. This plot shows a stackplot of the values Usefulness,First and the Model_Value of the heuristic. Note that usually Usefulness<First<Model_Value and that the bars overlap each other. If lmgON=TRUE. This plot shows the LMG values of the attributes in the order of the heuristic forward, backward or order.

Value

The procedure returns a linear regression model of class idaLm.

Examples

## Not run: 
#Create a pointer to table IRIS
idf <- ida.data.frame("IRIS")

#Calculate linear model in-db
lm1 <- idaLm(SepalLength~., idf)

library(ggplot2)
plot(lm1)

#Calculating linear models with categorical values requires an id column
lm1 <- idaLm(SepalLength~., idf, id="ID")

## End(Not run)

[Package ibmdbR version 1.51.0 Index]