rlars {robustHD} | R Documentation |
Robust least angle regression
Description
Robustly sequence candidate predictors according to their predictive content and find the optimal model along the sequence.
Usage
rlars(x, ...)
## S3 method for class 'formula'
rlars(formula, data, ...)
## Default S3 method:
rlars(
x,
y,
sMax = NA,
centerFun = median,
scaleFun = mad,
winsorize = FALSE,
const = 2,
prob = 0.95,
fit = TRUE,
s = c(0, sMax),
regFun = lmrob,
regArgs = list(),
crit = c("BIC", "PE"),
splits = foldControl(),
cost = rtmspe,
costArgs = list(),
selectBest = c("hastie", "min"),
seFactor = 1,
ncores = 1,
cl = NULL,
seed = NULL,
model = TRUE,
tol = .Machine$double.eps^0.5,
...
)
Arguments
x |
a matrix or data frame containing the candidate predictors. |
... |
additional arguments to be passed down. For the default
method, additional arguments to be passed down to
|
formula |
a formula describing the full model. |
data |
an optional data frame, list or environment (or object coercible
to a data frame by |
y |
a numeric vector containing the response. |
sMax |
an integer giving the number of predictors to be sequenced. If
it is |
centerFun |
a function to compute a robust estimate for the center
(defaults to |
scaleFun |
a function to compute a robust estimate for the scale
(defaults to |
winsorize |
a logical indicating whether to clean the full data set by
multivariate winsorization, i.e., to perform data cleaning RLARS instead of
plug-in RLARS (defaults to |
const |
numeric; tuning constant to be used in the initial corralation estimates based on adjusted univariate winsorization (defaults to 2). |
prob |
numeric; probability for the quantile of the
|
fit |
a logical indicating whether to fit submodels along the sequence
( |
s |
an integer vector of length two giving the first and last step
along the sequence for which to compute submodels. The default is to start
with a model containing only an intercept (step 0) and iteratively add all
variables along the sequence (step |
regFun |
a function to compute robust linear regressions along the
sequence (defaults to |
regArgs |
a list of arguments to be passed to |
crit |
a character string specifying the optimality criterion to be
used for selecting the final model. Possible values are |
splits |
an object giving data splits to be used for prediction error
estimation (see |
cost |
a cost function measuring prediction loss (see
|
costArgs |
a list of additional arguments to be passed to the
prediction loss function |
selectBest , seFactor |
arguments specifying a criterion for selecting
the best model (see |
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
seed |
optional initial seed for the random number generator (see
|
model |
a logical indicating whether the model data should be included in the returned object. |
tol |
a small positive numeric value. This is used in bivariate winsorization to determine whether the initial estimate from adjusted univariate winsorization is close to 1 in absolute value. In this case, bivariate winsorization would fail since the points form almost a straight line, and the initial estimate is returned. |
Value
If fit
is FALSE
, an integer vector containing the indices of
the sequenced predictors.
Else if crit
is "PE"
, an object of class
"perrySeqModel"
(inheriting from class "perrySelect"
,
see perrySelect
). It contains information on the
prediction error criterion, and includes the final model as component
finalModel
.
Otherwise an object of class "rlars"
(inheriting from class
"seqModel"
) with the following components:
active
an integer vector containing the indices of the sequenced predictors.
s
an integer vector containing the steps for which submodels along the sequence have been computed.
coefficients
a numeric matrix in which each column contains the regression coefficients of the corresponding submodel along the sequence.
fitted.values
a numeric matrix in which each column contains the fitted values of the corresponding submodel along the sequence.
residuals
a numeric matrix in which each column contains the residuals of the corresponding submodel along the sequence.
df
an integer vector containing the degrees of freedom of the submodels along the sequence (i.e., the number of estimated coefficients).
robust
a logical indicating whether a robust fit was computed (
TRUE
for"rlars"
models).scale
a numeric vector giving the robust residual scale estimates for the submodels along the sequence.
crit
an object of class
"bicSelect"
containing the BIC values and indicating the final model (only returned if argumentcrit
is"BIC"
and arguments
indicates more than one step along the sequence).muX
a numeric vector containing the center estimates of the predictors.
sigmaX
a numeric vector containing the scale estimates of the predictors.
muY
numeric; the center estimate of the response.
sigmaY
numeric; the scale estimate of the response.
x
the matrix of candidate predictors (if
model
isTRUE
).y
the response (if
model
isTRUE
).w
a numeric vector giving the data cleaning weights (if
winsorize
isTRUE
).call
the matched function call.
Author(s)
Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar
References
Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299. doi:10.1198/016214507000000950
See Also
coef
,
fitted
,
plot
,
predict
,
residuals
,
rstandard
,
lmrob
Examples
## generate data
# example is not high-dimensional to keep computation time low
library("mvtnorm")
set.seed(1234) # for reproducibility
n <- 100 # number of observations
p <- 25 # number of variables
beta <- rep.int(c(1, 0), c(5, p-5)) # coefficients
sigma <- 0.5 # controls signal-to-noise ratio
epsilon <- 0.1 # contamination level
Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p))
x <- rmvnorm(n, sigma=Sigma) # predictor matrix
e <- rnorm(n) # error terms
i <- 1:ceiling(epsilon*n) # observations to be contaminated
e[i] <- e[i] + 5 # vertical outliers
y <- c(x %*% beta + sigma * e) # response
x[i,] <- x[i,] + 5 # bad leverage points
## fit robust LARS model
rlars(x, y, sMax = 10)