booteval.relimp {relaimpo} | R Documentation |
Functions to Bootstrap Relative Importance Metrics
Description
These functions provide bootstrap confidence intervals for relative importances.
boot.relimp
uses the R package boot to do the actual bootstrapping of requested metrics
(which may take quite a while),
while booteval.relimp
evaluates the results and provides confidence intervals.
Output from booteval.relimp
is printed with a tailored print method, and a plot method
produces bar plots with confidence indication of the relative importance metrics.
Usage
## generic function
boot.relimp(object, ...)
## default S3 method
## Default S3 method:
boot.relimp(object, x = NULL, ..., b = 1000, type = "lmg",
rank = TRUE, diff = TRUE, rela = FALSE, always = NULL,
groups = NULL, groupnames = NULL, fixed=FALSE,
weights = NULL, design = NULL)
## S3 method for formula objects
## S3 method for class 'formula'
boot.relimp(formula, data, weights, na.action, ..., subset = NULL)
## S3 method for objects of class lm
## S3 method for class 'lm'
boot.relimp(object, type = "lmg", groups = NULL, groupnames=NULL, always = NULL,
..., b=1000)
## function for evaluating bootstrap results
booteval.relimp(bootrun, bty = "perc", level = 0.95,
sort = FALSE, norank = FALSE, nodiff = FALSE,
typesel = c("lmg", "pmvd", "last", "first", "betasq", "pratt", "genizi", "car"))
Arguments
object |
cf. |
formula |
cf. |
x |
cf. |
b |
is the number of bootstrap runs requested on boot.relimp (default: |
type |
cf. |
rank |
is a logical requesting bootstrapping of ranks ( |
diff |
is a logical requesting bootstrapping of pairwise differences in relative importance ( |
rela |
cf. |
always |
cf. |
groups |
cf. |
groupnames |
cf. |
weights |
cf. |
design |
cf. For a description of the bootstrap method's treatment of designs, see the details section. In the current version, using a design in bootstrapping must be considered EXPERIMENTAL. |
fixed |
is a logical requesting bootstrapping for a fixed design matrix (if TRUE). The default is bootstrapping for randomly drawn samples (fixed = FALSE). |
data |
cf. |
subset |
cf. |
na.action |
cf. |
... |
usable for further arguments, particularly all arguments of the default method can be given to all other methods |
bootrun |
is an object of class |
bty |
is the type of bootstrap interval requested (a character string),
as handed over to the function |
level |
is a single confidence level or a numeric vector of confidence levels. |
sort |
is a logical requesting output sorted by size of relative contribution ( |
norank |
is a logical that suppresses of rank letters ( |
nodiff |
is a logical that suppresses output of confidence intervals for differences ( |
typesel |
provides the metrics that are to be reported. Default: all available ones
(intersection of those available in object |
Details
Calculations of metrics are based on the function calc.relimp
.
Bootstrapping is done with the R package boot
,
resampling the full observation vectors by default (combinations of response, weights and regressors, cf. Fox (2002)).
If fixed=TRUE
is specified, bootstrapping is based on residuals rather than full observations,
keeping the X-variables fixed.
If the weights
option is used, weights are resampled together with the full observations,
and weighted contributions are calculated for each resample (no re-normalization is done within the resamples.)
Please note that usage of weights in linear models can have very different reasons.
One motivation is a different variability of different observations, where weights are the inverse variances.
This is the way weights
are treated in function lm
and also in calc.relimp
and in boot.relimp
,
if a vector of weights is given with the weights
option. Specifically, weights do not affect the
resampling probability in bootstrapping, i.e. each observation has the same probability to be included in resamples.
If the weights in a data frame represent the multiplicity of each observation (i.e. there are several units with
identical combination of values in the data, and the weights represent the number of units with exactly this value
pattern for each row of the data frame), they can also be directly used as weights in calc.relimp
for
calculating the metrics. However, such frequency weights cannot be appropriately accomodated in boot.relimp
;
instead, the data frame with frequencies has to be expanded to include one row for each unit before using the
resampling routine (e.g. using function untable
from package reshape or function expand.table
from package epitools.
In survey situations, weights are used to generate a more representative picture of the population:
an observation's weight is typically the number of units of the population that this single observed unit represents.
In this situation, there is no reason to consider observations with higher weights as less variable than observations
with lower weights; thus, while estimates can again be obtained treating the weights in the same way as mentioned before,
their usage in estimation of standard errors and in bootstrapping is different.
In order to appropriately accomodate survey weights for these purposes, it is not sufficient to only provide the
weights vector; instead, it is necessary to provide a design generated with package survey or an object of
class svyglm
(produced by function svyglm
) that includes the appropriate design information.
Clusters are a way to take care of dependency structures like in longitudinal data.
Thus, while relaimpo does not (currently) cover linear mixed models (e.g. produced by function lme
), it is possible to
accomodate clustered data by applying function svyglm
with linear link function and gaussian distribution
to a design that contains clusters. The bootstrapping approach subsequently takes care of the dependence by considering
clusters as sampling units. Users who want to use this approach can mimic the second example below.
If the design
option is used (experimental), resampling is done within package survey,
and the resampled contributions are also calculated within package survey. The results from these calculations
are plugged into an object from package boot, and confidence interval calculation is subsequently handled in
boot. The approach is considered experimental: so far no simulation studies have been conducted for
complex survey designs, and because of limited experience (in spite of thorough testing) it is not unlikely that
bugs will be found by users who are routinely using complex survey designs.
The output provides results for all selected relative importance metrics.
The output object can be printed and plotted (description of syntax: classesmethods.relaimpo
).
Printed output: In addition to the standard output of calc.relimp
(one row for each regressor, one column for each bootstrapped metric),
there is a table of confidence intervals for each selected metric
(one row per combination of regressor and metric).
This table is enhanced by information on rank confidence intervals,
if ranks have been bootstrapped (rank=TRUE
) and norank=FALSE
.
In addition, if differences have been bootstrapped (diff=TRUE
) and nodiff=FALSE
,
there is a table of estimated pairwise differences with confidence intervals.
Graphical output: Application of the plot method to the object created by booteval.relimp
yields barplot representations for all bootstrapped metrics (all in one graphics window). Confidence level (lev=
)
and number of characters in variable names to be used (names.abbrev=
) can be modified.
Confidence bounds are indicated on the graphs by added vertical lines.
par()
options can be used for modifying output (exceptions: mfrow
, oma
and
mar
are overridden by the plot method).
Value
The value of boot.relimp
is of class relimplmboot
. It is designed to be useful as input for booteval.relimp
and is not further described here.
booteval.relimp
returns an object of class relimplmbooteval
, the items of which can be accessed by
the $
or the @
extractors.
In addition to the items described for function calc.relimp
, which are also available here,
the following items may be of interest for further calculations:
metric.lower |
matrix of lower confidence bounds for “metric”: one row for each confidence level,
one column for each element of “metric”. “metric” can be any of |
metric.upper |
matrix of upper confidence bounds for “metric”: one row for each confidence level, one column for each element of “metric” |
metric.boot |
matrix of bootstrap results for “metric”: one row for each bootstrap run,
one column for each element of “metric”.
Here, “metric” can be chosen as any of the above-mentioned and also as |
nboot |
number of bootstrap runs underlying the evaluations |
level |
confidence levels |
Warning
The bootstrap confidence intervals should be used for exploratory purposes only. They can be somewhat liberal: Limited simulations for percentile intervals have shown that non-coverage probabilities can be up to twice the nominal probabilities. More investigations are needed.
Be aware that the method itself needs some computing time in case of many regressors. Hence, bootstrapping should be used with awareness of computing time issues.
relaimpo
is a package for univariate linear models.
Using relaimpo
on objects that inherit from class lm
but are not univariate linear model objects
may produce nonsensical results without warning. Objects of class mlm
or glm
with link functions other than identity
or family other than gaussian lead to an error message.
Note
There are two versions of this package. The version on CRAN is globally licensed under GPL version 2 (or later).
There is an extended version with the interesting additional metric pmvd
that is licensed according to GPL version 2
under the geographical restriction "outside of the US" because of potential issues with US patent 6,640,204. This version can be obtained
from Ulrike Groempings website (cf. references section). Whenever you load the package, a display tells you, which version you are loading.
Author(s)
Ulrike Groemping, BHT Berlin
References
Chevan, A. and Sutherland, M. (1991) Hierarchical Partitioning. The American Statistician 45, 90–96.
Darlington, R.B. (1968) Multiple regression in psychological research and practice. Psychological Bulletin 69, 161–182.
Feldman, B. (2005) Relative Importance and Value. Manuscript (Version 1.1, March 19 2005), downloadable at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2255827
Fox, J. (2002) Bootstrapping regression models. An R and S-PLUS Companion to Applied Regression: A web appendix to the book. https://socialsciences.mcmaster.ca/jfox/Books/Companion-2E/appendix/Appendix-Bootstrapping.pdf.
Genizi, A. (1993) Decomposition of R2 in multiple regression with correlated regressors. Statistica Sinica 3, 407–420. Downloadable at https://www3.stat.sinica.edu.tw/statistica/password.asp?vol=3&num=2&art=10
Groemping, U. (2006) Relative Importance for Linear Regression in R: The Package relaimpo Journal of Statistical Software 17, Issue 1. Downloadable at https://www.jstatsoft.org/v17/i01
Lindeman, R.H., Merenda, P.F. and Gold, R.Z. (1980) Introduction to Bivariate and Multivariate Analysis, Glenview IL: Scott, Foresman.
Zuber, V. and Strimmer, K. (2010) Variable importance and model selection by decorrelation. Preprint, downloadable at https://arxiv.org/abs/1007.5516
Go to https://prof.bht-berlin.de/groemping/relaimpo/ for further information and references.
See Also
relaimpo, calc.relimp
, mianalyze.relimp
,
classesmethods.relaimpo
Examples
#####################################################################
### Example: relative importance of various socioeconomic indicators
### for Fertility in Switzerland
### Fertility is first column of data set swiss
#####################################################################
data(swiss)
# bootstrapping
bootswiss <- boot.relimp(swiss, b = 100,
type = c("lmg", "last", "first", "pratt"),
rank = TRUE, diff = TRUE, rela = TRUE)
# for demonstration purposes only 100 bootstrap replications
#alternatively, use formula interface
bootsub <- boot.relimp(Fertility~Education+Catholic+Infant.Mortality, swiss,
subset=Catholic>40, b = 100, type = c("lmg", "last", "first", "pratt"),
rank = TRUE, diff = TRUE)
# for demonstration purposes only 100 bootstrap replications
#default output (percentily intervals, as of Version 2 of the package)
booteval.relimp(bootswiss)
plot(booteval.relimp(bootswiss))
#sorted printout, chosen confidence levels, chosen interval method
#store as object
result <- booteval.relimp(bootsub, bty="bca",
sort = TRUE, level=c(0.8,0.9))
#because of only 100 bootstrap replications,
#default bca intervals produce warnings
#output driven by print method
result
#result plotting with default settings
#(largest confidence level, names abbreviated to length 4)
plot(result)
#result plotting with modified settings (chosen confidence level,
#names abbreviated to chosen length)
plot(result, level=0.8,names.abbrev=5)
#result plotting with longer names shown vertically
par(las=2)
plot(result, level=0.9,names.abbrev=6)
#plot does react to options set with par()
#exceptions: mfrow, mar and oma are set within the plot routine itself
#####################################################################
### Example: bootstrapping clustered data
### data taken from example.lmm of package lmm
### y is change in pulse (heart beats per minute)
### 15 min (occ 1 to 3) and 90 min (occ 4 to 6) after
### treatment with Placebo (occ 1 or 4) low (occ 2 or 5)
### or high (occ 3 or 6) dose of marihuana
### each of 9 subjects is observed under most or all
### of the 6 possible conditions
#####################################################################
## create example data from package lmm
y <- c(16,20,16,20,-6,-4,
12,24,12,-6,4,-8,
8,8,26,-4,4,8,
20,8,20,-4,
8,4,-8,22,-8,
10,20,28,-20,-4,-4,
4,28,24,12,8,18,
-8,20,24,-3,8,-24,
20,24,8,12)
occ <- c(1,2,3,4,5,6,
1,2,3,4,5,6,
1,2,3,4,5,6,
1,2,5,6,
1,2,3,5,6,
1,2,3,4,5,6,
1,2,3,4,5,6,
1,2,3,4,5,6,
2,3,4,5)
subj <- c(1,1,1,1,1,1,
2,2,2,2,2,2,
3,3,3,3,3,3,
4,4,4,4,
5,5,5,5,5,
6,6,6,6,6,6,
7,7,7,7,7,7,
8,8,8,8,8,8,
9,9,9,9)
### manual creation of dummies
### reference category placebo after 90min (occ=4)
dumm1 <- as.numeric(occ==1)
dumm2 <- as.numeric(occ==2)
dumm3 <- as.numeric(occ==3)
dumm5 <- as.numeric(occ==5)
dumm6 <- as.numeric(occ==6)
## create data frame
dat <- data.frame(y,dumm1,dumm2,dumm3,dumm5,dumm6,subj)
### create design with clusters
des <- svydesign(id=~subj,data=dat)
### apply bootstrapping
### using the design with subjects as clusters implies that
### clusters are generally kept in or excluded as a whole
### of course, b=100 is too small, only chosen for speed of package creation
bt <- boot.relimp(y~dumm1+dumm2+dumm3+dumm5+dumm6,data=dat,
design=des,b=100,type=c("lmg","first","last","betasq","pratt"))
### calculate and display results
booteval.relimp(bt,lev=0.9,nodiff=TRUE)