idaLm {ibmdbR} | R Documentation |
Linear regression
Description
This function performs linear regression on the contents of an ida.data.frame
.
Usage
idaLm(form, idadf, id = "id", modelname = NULL, dropModel = TRUE, limit = 25)
## S3 method for class 'idaLm'
print(x, ...)
## S3 method for class 'idaLm'
predict(object, newdata, id, outtable = NULL, ...)
## S3 method for class 'idaLm'
plot(x, names = TRUE, max_forw = 50, max_plot = 15, order = NULL,
lmgON = FALSE, backwardON = FALSE, ...)
Arguments
form |
A |
idadf |
An ida.data.frame that contains the input data for the function. |
id |
The name of the column that contains a unique ID for each row of the input data. An id column needs to be specified, if a model contains categorical values, more than 41 columns or when dropModel is set to FALSE. If no valid id column was specified, a temporary id column will be used (not for DB2 for z/OS). |
modelname |
Name of the model that will be created in the database. |
dropModel |
logical: If TRUE the in database model will be dropped after the calculation. |
limit |
The maximum number of levels for a categorical column. Its default value is 25. This parameter only exists for consistency with older version of idaLm. |
x |
An object of the class |
object |
An object of the class |
newdata |
An ida.data.frame that contains data that will be predicted. |
outtable |
The name of the table where the results will be written in. |
names |
|
max_forw |
|
max_plot |
|
order |
Vector of attribute names. The method will calculate the value of the models with the attributes in the order of the vector and plot the value for each of it. |
lmgON |
|
backwardON |
|
... |
Additional parameters. |
Details
The idaLm
function computes a linear regression model by extracting a covariance matrix and
computing its inverse. This implementation is optimized for problems that involve a large number of
samples and a relatively small number of predictors. The maximum number of columns is 78.
Missing values in the input table are ignored when calculating the covariance matrix. If this leads to undefined entries in the covariance matrix, the function fails. If the inverse of the covariance matrix cannot be computed (for example, due to correlated predictors), the Moore-Penrose generalized inverse is used instead.
The output of the idaLm function has the following attributes:
$coefficients is a vector with two values. The first value is the slope of the line that best fits the input data; the second value is its y-intercept.
$RSS is the root sum square (that is, the square root of the sum of the squares).
$effects is not used and can be ignored.
$rank is the rank.
$df.residuals is the number of degrees of freedom associated with the residuals.
$coefftab is a is a vector with four values:
The slope and y-intercept of the line that best fits the input data
The standard error
The t-value
The p-value
$Loglike is the log likelihood ratio.
$AIC is the Akaike information criterion. This is a measure of the relative quality of the model.
$BIC is the Bayesian information criterion. This is used for model selection.
$CovMat the Matrix used in the calculation ("Covariance Matrix"). This matrix is necessary for the Calculation in plot.idaLm and the statistics.
$card the number of dummy variables created for categorical columns and 1 for numericals.
$model the in database modelname of the idaLm object.
$numrow the number of rows of the input table that do not contain NAs.
$sigma the residual standard error.
The plot.idaLm
function uses R^2
as a measure of quality of a linear model.
R^2
compares the variance of the predicted values and the variance of the actual values
of the target variable.
$First: Returns the R^2
value of the linear model for each attribute alone.
$Usefulness: Returns the R^2
value reduction of the linear model with all
attributes to the linear model with one attribute taken away.
$Forward_Values: Is only calculated if backwardON=FALSE. This is a heuristic that adds in
each step the attribute which has the most R^2
increase.
$LMG: Is only calculated if lmgON=TRUE. It returns the increase of R^2
of
each attribute averaged over every possible permutation. By grouping some
of the permutations we only need to average over every possible subset.
For n attributes there are 2^n
subsets. So LMG is an algorithm with
exponential runningtime and is not recommended for more than 15
attributes.
$Backward_Values: Is only calculated if backwardON=TRUE. Similar to the forward heuristic.
This time we choose in each step of the algorithm that has minimal
R^2
reduction when taking it out of the model, starting with all
attributes.
$Model_Values: Is only calculated if order is a vector of attributes. In this case the
function calculates the R^2
value for the models that we get when
we add one attribute of order in each step.
RelImpPlot.png: If lmgON=FALSE. This plot shows a stackplot of the values Usefulness,First and the Model_Value of the heuristic. Note that usually Usefulness<First<Model_Value and that the bars overlap each other. If lmgON=TRUE. This plot shows the LMG values of the attributes in the order of the heuristic forward, backward or order.
Value
The procedure returns a linear regression model of class idaLm
.
Examples
## Not run:
#Create a pointer to table IRIS
idf <- ida.data.frame("IRIS")
#Calculate linear model in-db
lm1 <- idaLm(SepalLength~., idf)
library(ggplot2)
plot(lm1)
#Calculating linear models with categorical values requires an id column
lm1 <- idaLm(SepalLength~., idf, id="ID")
## End(Not run)