bestsetNoise {DAAG}  R Documentation 
Best Subset Selection Applied to Noise
Description
Best subset selection applied to completely random noise. This function demonstrates how variable selection techniques in regression can often err in including explanatory variables that are indistinguishable from noise.
Usage
bestsetNoise(m = 100, n = 40, method = "exhaustive", nvmax = 3,
X = NULL, y=NULL, intercept=TRUE,
print.summary = TRUE, really.big = FALSE, ...)
bestset.noise(m = 100, n = 40, method = "exhaustive", nvmax = 3,
X = NULL, y=NULL, intercept=TRUE,
print.summary = TRUE, really.big = FALSE, ...)
bsnCV(m = 100, n = 40, method = "exhaustive", nvmax = 3,
X = NULL, y=NULL, intercept=TRUE, nfolds = 2,
print.summary = TRUE, really.big = FALSE)
bsnOpt(X = matrix(rnorm(25 * 10), ncol = 10), y = NULL, method = "exhaustive",
nvmax = NULL, nbest = 1, intercept = TRUE, criterion = "cp",
tcrit = NULL, print.summary = TRUE, really.big = FALSE,
...)
bsnVaryNvar(m = 100, nvar = nvmax:50, nvmax = 3, method = "exhaustive",
intercept=TRUE,
plotit = TRUE, xlab = "# of variables from which to select",
ylab = "pvalues for tstatistics", main = paste("Select 'best'",
nvmax, "variables"),
details = FALSE, really.big = TRUE, smooth = TRUE, ...)
Arguments
m 
the number of observations to be simulated, ignored if X is supplied. 
n 
the number of predictor variables in the simulated model, ignored if X is supplied. 
method 
Use 
nvmax 
Number of explanatory variables in model. 
X 
Use columns from this matrix. Alternatively, X may be a
data frame, in which case a model matrix will be formed from it.
If not 
y 
If not supplied, random normal noise will be generated. 
nbest 
Number of models, for each choice of number of columns
of explanatory variables, to return ( 
intercept 
Should an intercept be added? 
nvar 
range of number of candidate variables ( 
nfolds 
For splitting the data into training and text sets, the number of folds. 
criterion 
Criterion to use in choosing between models with
different numbers of explanatory variables ( 
tcrit 
Consider only those models for which the minimum absolute
tstatistic is greater than 
print.summary 
Should summary information be printed. 
plotit 
Plot a graph? ( 
xlab 
xlabel for graph ( 
ylab 
ylabel for graph ( 
main 
main title for graph ( 
details 
Return detailed output list ( 
really.big 
Set to 
smooth 
Fit smooth to graph? ( 
... 
Additional arguments, to be passed through to

Details
If X
is not supplied, and in any case for bsnVaryNvar
, a
set of n
predictor variables are simulated as independent
standard normal, i.e. N(0,1), variates. Additionally a N(0,1) response
variable is simulated. The function bsnOpt
selects the
‘best’ model with nvmax
or fewer explanatory variables,
where the argument criterion
specifies the criterion that will
be used to choose between models with different numbers of explanatory
columns. Other functions select the ‘best’ model with
nvmax
explanatory columns. In any case, the selection is made
using the regsubsets()
function from the leaps package.
(The leaps package must be installed for this function to work.)
The function bsnCV
splits the data (randomly) into nfolds
(2 or more) parts. It puts each part aside in turn for use to fit
the model (effectively, test data), with the remaining data used
for selecting the variables that will be used for fitting. One model
fit is returned for each of the nfolds
parts.
The function bsnVaryVvar
makes repeated calls to
bestsetNoise
Value
bestsetNoise
returns the lm
model object for the "best"
model with nvmax
explanatory columns.
bsnCV
returns as many models as there are folds.
bsnVaryVvar
silently returns either (details=FALSE
) a
matrix that has pvalues of the coefficients for the ‘best’
choice of model for
each different number of candidate variables, or
(details=TRUE
) a list with elements:
coef 
A matrix of sets of regression coefficients 
SE 
A matrix of standard errors 
pval 
A matrix of pvalues 
Matrices have one row for each choice of nvar
. The statistics
returned are for the ‘best’ model with nvmax explanatory
variables.
bsnOpt
silently returns a list with elements:
u1 
‘best’ model ( 
tcrit 
For each model, the minimum of the absolute values of the tstatistics. 
regsubsets_obj 
The object returned by the call to 
Note
These functions are primarily designed to demonstrate the biases
that can be expected, relative to theoretical estimates of standard
errors of parameters and other fitted model statistics, when there
is prior selection of the columns that are to be included in the
model. With the exception of bsnVaryNvar
, they can also be
used with an X
and y
for actual data. In that case,
the pvalues should be compared with those
obtained from repeated use of the function where y
is random
noise, as a check on the extent of selection effects.
Author(s)
J.H. Maindonald
See Also
Examples
leaps.out < try(require(leaps, quietly=TRUE))
leaps.out.log < is.logical(leaps.out)
if ((leaps.out.log==TRUE)&(leaps.out==TRUE)){
bestsetNoise(20,6) # `best' 3variable regression for 20 simulated observations
# on 7 unrelated variables (including the response)
bsnCV(20,6) # `best' 3variable regressions (one for each fold) for 20
# simulated observations on 7 unrelated variables
# (including the response)
bsnVaryNvar(m = 50, nvar = 3:6, nvmax = 3, method = "exhaustive",
plotit=FALSE, details=TRUE)
bsnOpt()
}