mvinfluence {mvinfluence}R Documentation

Influence Measures and Diagnostic Plots for Multivariate Linear Models


Functions in this package compute regression deletion diagnostics for multivariate linear models following methods proposed by Barrett & Ling (1992) and provide some associated diagnostic plots.


The design goal for this package is that, as an extension of standard methods for univariate linear models, you should be able to fit a linear model with a multivariate response,

  mymlm <- lm( cbind(y1, y2, y3) ~ x1 + x2 + x3, data=mydata)

and then get useful diagnostics and plots with

  influencePlot(mymlm, ...)  

The diagnostic measures include hat-values (leverages), generalized Cook's distance and generalized squared 'studentized' residuals. Several types of plots to detect influential observations are provided.

In addition, the functions provide diagnostics for deletion of subsets of observations of size m>1. This case is theoretically interesting because sometimes pairs (m=2) of influential observations can mask each other, sometimes they can have joint influence far exceeding their individual effects, as well as other interesting phenomena described by Lawrence (1995). Associated methods for the case m>1 are still under development in this package.

The main function in the package is the S3 method, influence.mlm, a simple wrapper for mlm.influence, which does the actual computations. This design was dictated by that used in the stats package, which provides the generic method influence and methods influence.lm and influence.glm. The car package extends this to include influence.lme for models fit by lme.

The following sections describe the notation and measures used in the calculations.


Let X\mathbf{X} be the model matrix in the multivariate linear model, Yn×p=Xn×rβr×p+En×p\mathbf{Y}_{n \times p} = \mathbf{X}_{n \times r} \mathbf{\beta}_{r \times p} + \mathbf{E}_{n \times p}. The usual least squares estimate of β\mathbf{\beta} is given by B=(XTX)1XTY\mathbf{B} = (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{X}^{T} \mathbf{Y}.

Then let

Matrices YI\mathbf{Y}_I, Y(I)\mathbf{Y}_{(I)} are defined similarly.

In the calculation of regression coefficients, B(I)=(X(I)TX(I))1X(I)TYI\mathbf{B}_{(I)} = (\mathbf{X}_{(I)}^{T} \mathbf{X}_{(I)})^{-1} \mathbf{X}_{(I)}^{T} \mathbf{Y}_{I} are the estimated coefficients when the cases indexed by II have been removed. The corresponding residuals are E(I)=Y(I)X(I)B(I)\mathbf{E}_{(I)} = \mathbf{Y}_{(I)} - \mathbf{X}_{(I)} \mathbf{B}_{(I)}.


The influence measures defined by Barrett & Ling (1992) are functions of two matrices HI\mathbf{H}_I and QI\mathbf{Q}_I defined as follows:

Cook's distance

In these terms, Cook's distance is defined for a univariate response by

DI=(bb(I))T(XTX)(bb(I))/ps2  ,D_I = (\mathbf{b} - \mathbf{b}_{(I)})^T (\mathbf{X}^T \mathbf{X}) (\mathbf{b} - \mathbf{b}_{(I)}) / p s^2 \; ,

a measure of the squared distance between the coefficients b\mathbf{b} for the full data set and those b(I)\mathbf{b}_{(I)} obtained when the cases in II are deleted.

In the multivariate case, Cook's distance is obtained by replacing the vector of coefficients b\mathbf{b} by vec(B)\mathrm{vec} (\mathbf{B}), the result of stringing out the coefficients for all responses in a single n×pn \times p-length vector.

DI=1p[vec(BB(I))]T(S1XTX)vec(BB(I))  ,D_I = \frac{1}{p} [\mathrm{vec} (\mathbf{B} - \mathbf{B}_{(I)})]^T (S_{-1} \otimes \mathbf{X}^T \mathbf{X}) \mathrm{vec} (\mathbf{B} - \mathbf{B}_{(I)}) \; ,

where \otimes is the Kronecker (direct) product and S=ETE/(np)\mathbf{S} = \mathbf{E}^T \mathbf{E} / (n-p) is the covariance matrix of the residuals.

Leverage and residual components

For a univariate response, and when m = 1, Cook's distance can be re-written as a product of leverage and residual components as

Di=(npp)hii(1hii)2qii  .D_i = \left(\frac{n-p}{p} \right) \frac{h_{ii}}{(1 - h_{ii})^2 q_{ii} } \;.

Then we can define a leverage component LiL_i and residual component RiR_i as

Li=hii1hiiRi=qii1hii  .L_i = \frac{h_{ii}}{1 - h_{ii}} \quad\quad R_i = \frac{q_{ii}}{1 - h_{ii}} \;.

RiR_i is the studentized residual, and DiLi×RiD_i \propto L_i \times R_i.

In the general, multivariate case there are analogous matrix expressions for L\mathbf{L} and R\mathbf{R}. When m > 1, the quantities HI\mathbf{H}_I, QI\mathbf{Q}_I, LI\mathbf{L}_I, and RI\mathbf{R}_I are m×mm \times m matrices. Where scalar quantities are needed, the package functions apply a function, FUN, either det() or tr() to calculate a measure of “size”, as in

  H <- sapply(x$H, FUN)
  Q <- sapply(x$Q, FUN)
  L <- sapply(x$L, FUN)
  R <- sapply(x$R, FUN)


Barrett, B. E. and Ling, R. F. (1992). General Classes of Influence Measures for Multivariate Regression. Journal of the American Statistical Association, 87(417), 184-191.

Barrett, B. E. (2003). Understanding Influence in Multivariate Regression. Communications in Statistics – Theory and Methods, 32, 3, 667-680.

A. J. Lawrence (1995). Deletion Influence and Masking in Regression. Journal of the Royal Statistical Society. Series B (Methodological) , 57, 1, 181-189.

[Package mvinfluence version 0.9.0 Index]