R: Checks and analyzes leave-one-out (LOO) p-values in linear...

lmInfl {reverseR}

R Documentation

Checks and analyzes leave-one-out (LOO) p-values in linear regression

Description

This function calculates leave-one-out (LOO) p-values for all data points and identifies those resulting in "significance reversal", i.e. in the p-value of the model's slope traversing the user-defined \alpha-level.

Usage

lmInfl(model, alpha = 0.05, method = c("pearson", "spearman"), verbose = TRUE, ...)

Arguments

`model`	the linear model of class `lm`.
`alpha`	the `\alpha`-level to use as the threshold border.
`method`	select either parametric (`"pearson"`) or rank-based (`"spearman"`) statistics.
`verbose`	logical. If `TRUE`, results are displayed on the console.
`...`	other arguments to `lm`.

Details

The algorithm
1) calculates the p-value of the full model (all points),
2) calculates a LOO-p-value for each point removed,
3) checks for significance reversal in all data points and
4) returns all models as well as classical influence.measures with LOO-p-values, \Deltap-values, slopes and standard errors attached.
If method = "spearman", p-values are based on Spearman Rank correlation, and the values given in the last column of the result matrix are Spearman's \rho.

The idea of p-value influencers was first introduced by Belsley, Kuh & Welsch, and described as an influence measure pertaining directly to the change in t-statistics, that will "show whether the conclusions of hypothesis testing would be affected", termed dfstat in [1, 2, 3] or dfstud in [4]:

\rm{dfstat}_{ij} \equiv \frac{\hat{\beta}_j}{s\sqrt{(X'X)^{-1}_{jj}}}-\frac{\hat{\beta}_{j(i)}}{s_{(i)}\sqrt{(X'_{(i)}X_{(i)})^{-1}_{jj}}}

where \hat{\beta}_j is the j-th estimate, s is the residual standard error, X is the design matrix and (i) denotes the i-th observation deleted.
dfstat, which for the regression's slope \beta_1 is the difference of t-statistics

\Delta t = t_{\beta1} - t_{\beta1(i)} = \frac{\beta_1}{\rm{s.e.(\beta_1)}} - \frac{\beta_1(i)}{\rm{s.e.(\beta_1(i)})}

is inextricably linked to the changes in p-value \Delta p, calculated from

\Delta p = p_{\beta1} - p_{\beta1(i)} = 2\left(1-P_t(t_{\beta1}, \nu)\right) - 2\left(1-P_t(t_{\beta1(i)} , \nu-1)\right)

where P_t is the Student's t cumulative distribution function with \nu degrees of freedom, and where significance reversal is attained when \alpha \in [p_{\beta1}, p_{\beta1(i)}]. Interestingly, in linear regression the seemingly mandatory check of the influence of single data points on statistical inference is living in oblivion: apart from [1-4], there is, to the best of our knowledge, no reference to dfstat or \Delta p in current literature on influence measures.

The influence output also includes the more recent Hadi's measure (column "hadi"):

H_i = \frac{p_{ii}}{1 - p_{ii}} + \frac{k}{1 - p_{ii}}\frac{d_i^2}{(1-d_i^2)}

where p_{ii} are the diagonals of the hat matrix (leverages), k = 2 in univariate linear regression and d_i = e_i/\sqrt{\rm{SSE}}.

Value

A list with the following items:

`origModel`	the original model with all data points.
`finalModels`	a list of final models with the influencer(s) removed.
`infl`	a matrix with the original data, classical `influence.measures`, studentized residuals, leverages, LOO-p-values, LOO-slopes/intercepts and their `\Delta`'s, LOO-standard errors and `R^2`s.
`sel`	a vector with the influencers' indices.
`alpha`	the selected `\alpha`-level.
`origP`	the original model's p-value.
`stab`	the stability measure, see `stability`.

Author(s)

Andrej-Nikolai Spiess

References

For dfstat / dfstud :
1. Regression diagnostics: Identifying influential data and sources of collinearity.
Belsley DA, Kuh E, Welsch RE.
John Wiley, New York, USA (2004).

2. Econometrics, 5ed.
Baltagi B.
Springer-Verlag Berlin, Germany (2011).

3. Growth regressions and what the textbooks don't tell you.
Temple J.
Bull Econom Res, 52, 2000, 181-205.

4. Robust Regression and Outlier Detection.
Rousseeuw PJ & Leroy AM.
John Wiley & Sons, New York, NY (1987).

Hadi's measure:
A new measure of overall potential influence in linear regression.
Hadi AS.
Comp Stat & Data Anal, 14, 1992, 1-27.

Examples

## Example #1 with single influencers and insignificant model (p = 0.115).
## Removal of #18 results in p = 0.0227!
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM1 <- lm(b ~ a)
res1 <- lmInfl(LM1) 
lmPlot(res1)
pvalPlot(res1)
inflPlot(res1)
slsePlot(res1)
stability(res1)

## Example #2 with multiple influencers and significant model (p = 0.0269).
## Removal of #2, #17, #18 or #20 result in crossing p = 0.05!
set.seed(125)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM2 <- lm(b ~ a)
res2 <- lmInfl(LM2) 
lmPlot(res2)
pvalPlot(res2)
inflPlot(res2)
slsePlot(res2)
stability(res2)

## Large Example #3 with top 10 influencers and significant model (p = 6.72E-8).
## Not possible to achieve a crossing of alpha with any point despite strong noise.
set.seed(123)
a <- 1:100
b <- 5 + 0.08 * a + rnorm(100, 0, 5)
LM3 <- lm(b ~ a)
res3 <- lmInfl(LM3) 
lmPlot(res3)
stability(res3)

## Example #4 with replicates and single influencer (p = 0.114).
## Removal of #58 results in p = 0.039.
set.seed(123)
a <- rep(1:20, each = 3)
b <- 5 + 0.08 * a + rnorm(20, 0, 2)
LM4 <- lm(b ~ a)
res4 <- lmInfl(LM4) 
lmPlot(res4)
pvalPlot(res4)
inflPlot(res4)
slsePlot(res4)
stability(res4)

## As Example #1, but with weights.
## Removal of #18 results in p = 0.04747.
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM5 <- lm(b ~ a, weights = 1:20)
res5 <- lmInfl(LM5) 
lmPlot(res5)
stability(res5)

[Package reverseR version 0.1 Index]