lmInfl {reverseR} | R Documentation |
Checks and analyzes leave-one-out (LOO) p-values in linear regression
Description
This function calculates leave-one-out (LOO) p-values for all data points and identifies those resulting in "significance reversal", i.e. in the p-value of the model's slope traversing the user-defined -level.
Usage
lmInfl(model, alpha = 0.05, method = c("pearson", "spearman"), verbose = TRUE, ...)
Arguments
model |
the linear model of class |
alpha |
the |
method |
select either parametric ( |
verbose |
logical. If |
... |
other arguments to |
Details
The algorithm
1) calculates the p-value of the full model (all points),
2) calculates a LOO-p-value for each point removed,
3) checks for significance reversal in all data points and
4) returns all models as well as classical influence.measures
with LOO-p-values, p-values, slopes and standard errors attached.
If method = "spearman"
, p-values are based on Spearman Rank correlation, and the values given in the last column of the result matrix are Spearman's .
The idea of p-value influencers was first introduced by Belsley, Kuh & Welsch, and described as an influence measure pertaining directly to the change in t-statistics, that will "show whether the conclusions of hypothesis testing would be affected", termed dfstat in [1, 2, 3] or dfstud in [4]:
where is the j-th estimate, s is the residual standard error, X is the design matrix and (i) denotes the i-th observation deleted.
dfstat, which for the regression's slope is the difference of t-statistics
is inextricably linked to the changes in p-value , calculated from
where is the Student's t cumulative distribution function with
degrees of freedom, and where significance reversal is attained when
.
Interestingly, in linear regression the seemingly mandatory check of the influence of single data points on statistical inference is living in oblivion: apart from [1-4], there is, to the best of our knowledge, no reference to dfstat or
in current literature on influence measures.
The influence output also includes the more recent Hadi's measure (column "hadi"):
where are the diagonals of the hat matrix (leverages),
in univariate linear regression and
.
Value
A list with the following items:
origModel |
the original model with all data points. |
finalModels |
a list of final models with the influencer(s) removed. |
infl |
a matrix with the original data, classical |
sel |
a vector with the influencers' indices. |
alpha |
the selected |
origP |
the original model's p-value. |
stab |
the stability measure, see |
Author(s)
Andrej-Nikolai Spiess
References
For dfstat / dfstud :
1. Regression diagnostics: Identifying influential data and sources of collinearity.
Belsley DA, Kuh E, Welsch RE.
John Wiley, New York, USA (2004).
2. Econometrics, 5ed.
Baltagi B.
Springer-Verlag Berlin, Germany (2011).
3. Growth regressions and what the textbooks don't tell you.
Temple J.
Bull Econom Res, 52, 2000, 181-205.
4. Robust Regression and Outlier Detection.
Rousseeuw PJ & Leroy AM.
John Wiley & Sons, New York, NY (1987).
Hadi's measure:
A new measure of overall potential influence in linear regression.
Hadi AS.
Comp Stat & Data Anal, 14, 1992, 1-27.
Examples
## Example #1 with single influencers and insignificant model (p = 0.115).
## Removal of #18 results in p = 0.0227!
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM1 <- lm(b ~ a)
res1 <- lmInfl(LM1)
lmPlot(res1)
pvalPlot(res1)
inflPlot(res1)
slsePlot(res1)
stability(res1)
## Example #2 with multiple influencers and significant model (p = 0.0269).
## Removal of #2, #17, #18 or #20 result in crossing p = 0.05!
set.seed(125)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM2 <- lm(b ~ a)
res2 <- lmInfl(LM2)
lmPlot(res2)
pvalPlot(res2)
inflPlot(res2)
slsePlot(res2)
stability(res2)
## Large Example #3 with top 10 influencers and significant model (p = 6.72E-8).
## Not possible to achieve a crossing of alpha with any point despite strong noise.
set.seed(123)
a <- 1:100
b <- 5 + 0.08 * a + rnorm(100, 0, 5)
LM3 <- lm(b ~ a)
res3 <- lmInfl(LM3)
lmPlot(res3)
stability(res3)
## Example #4 with replicates and single influencer (p = 0.114).
## Removal of #58 results in p = 0.039.
set.seed(123)
a <- rep(1:20, each = 3)
b <- 5 + 0.08 * a + rnorm(20, 0, 2)
LM4 <- lm(b ~ a)
res4 <- lmInfl(LM4)
lmPlot(res4)
pvalPlot(res4)
inflPlot(res4)
slsePlot(res4)
stability(res4)
## As Example #1, but with weights.
## Removal of #18 results in p = 0.04747.
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM5 <- lm(b ~ a, weights = 1:20)
res5 <- lmInfl(LM5)
lmPlot(res5)
stability(res5)