High Dimensional Discriminant Analysis


Performs Linear Discriminant Analysis in High Dimensional problems based on reliable covariance estimators for problems with (many) more variables than observations. Includes routines for classifier training, prediction, cross-validation and variable selection.


HiDimDA is a package for High-Dimensional Discriminant Analysis aimed at problems with many variables, possibly much more than the number of available observations. Its core consists of the four Linear Discriminant Analyis routines:

Dlda: Diagonal Linear Discriminant Analysis
Slda: Shrunken Linear Discriminant Analysis
Mlda: Maximum-uncertainty Linear Discriminant Analysis
RFlda: Factor-model Linear Discriminant Analysis

and the variable selection routine:

SelectV: High-Dimensional variable selection for supervised classification

that selects variables to be used in a Discriminant classification rule by ranking them according to two-sample t-scores (problems with two-groups), or ANOVA F-scores (problems wih more that two groups), and discarding those with scores below a threshold defined by the Higher Criticism (HC) approach of Donoho and Jin (2008), the Expanded Higher Criticism scheme proposed by Duarte Silva (2011), False Discovery Rate (Fdr) control as suggested by Benjamini and Hochberg (1995), the FAIR approach of Fan and Fan (2008), or simply by fixing the number of retained variables to some pre-defined constant.

All four discriminant routines, ‘Dlda’, ‘Slda’, ‘Mlda’ and ‘RFlda’, compute Linear Discriminant Functions, by default after a preliminary variable selection step, based on alternative estimators of a within-groups covariance matrix that leads to reliable allocation rules in problems where the number of selected variables is close to, or larger than, the number of available observations.

Consider a Discriminant Analysis problem with kk groups, pp selected variables, a training sample consisting of N=g=1kngN = \sum_{g=1}^{k}n_g observations with group and overall means, Xˉg\bar{X}_g and Xˉ.\bar{X}_., and a between-groups scatter (scaled by degrees of freedom) matrix, SB=1Nkg=1kng(XˉgXˉ.)(XˉgXˉ.)TS_B = \frac{1}{N-k} \sum_{g=1}^{k} n_g (\bar{X}_g -\bar{X}_.)(\bar{X}_g -\bar{X}_.)^T

Following the two main classical approaches to Linear Discrimant Analysis, the Discriminant Functions returned by HiDimDA discriminant routines are either based on the canonical linear discriminants given by the normalized eigenvectors

LDj=Egvctj(SBΣ^W1)LD_j = Egvct_j (S_B \hat{\Sigma}_W^{-1})

j=1,...,r=min(p,k1)j = 1,...,r=min(p,k-1)

[LD1,...,LDr]TΣ^W[LD1,...,LDr]=Ir[LD_1, ..., LD_r]^T \hat{\Sigma}_W [LD_1, ..., LD_r] = I_r

or the classification functions

CFg=(XˉgXˉ1)Σ^W1CF_g = (\bar{X}_g - \bar{X}_1) \hat{\Sigma}_W^{-1}

g=2,...,kg = 2,...,k

where Σ^W1\hat{\Sigma}_W^{-1} is an estimate of the inverse within-groups covariance.

It is well known that these two approaches are equivalent, in the sense that classification rules that assign new observations to the group with the closest (according to the Euclidean distance) centroid in the space of the canonical variates, Z=[LD1...LDr]TXZ = [LD_1 ... LD_r]^T X , give the same results as the rule that assigns a new observation to group 1 if all classification scores, Clscrg=CFgTXCFgT(Xˉ1+Xˉg)2Clscr_g = CF_g^T X - CF_g^T \frac{(\bar{X}_1 + \bar{X}_g)}{2} , are negative, and to the group with the highest classification score otherwise.

The discriminant routines of HiDimDA compute canonical linear discriminant functions by default, and classification functions when the argument ‘ldafun’ is set to “classification”. However, unlike traditional linear discriminant analysis where ΣW1\Sigma_W^{-1} is estimated by the inverse of the sample covariance, which is not well-defined when pNkp \geq N-k and is unreliable if pp is close to NkN-k, the routines of HiDimDA use four alternative well-conditioned estimators of ΣW1\Sigma_W^{-1} that lead to reliable classification rules if pp is larger than, or close to, NkN-k.

In particular, ‘Dlda’ estimates ΣW1\Sigma_W^{-1} by the diagonal matrix of inverse sample variances, ‘Slda’ by the inverse of an optimally shrunken Ledoit and Wolf's (2004) covariance estimate with the targets and optimal target intensity estimators proposed by Fisher and Sun (2011), ‘Mlda’ uses a regularized inverse covariance that deemphasizes the importance given to the last eigenvectors of the sample covariance (see Thomaz, Kitani and Gillies (2006) for details), and ‘RFlda’ uses a factor model estimate of the true inverse correlation (or covariance) matrix based on the approach of Duarte Silva (2011).

The HiDimDA package also includes predict methods for all discriminant routines implemented, a routine (‘DACrossVal’) for asssessing the quality of the classification results by kfold cross-validation, and utilities for storing, extracting and efficiently handling specialized high-dimensional covariance and inverse covariance matrix estimates.


Antonio Pedro Duarte Silva <psilva@porto.ucp.pt>

Maintainer: Antonio Pedro Duarte Silva <psilva@porto.ucp.pt>


