HiDimDA-package {HiDimDA} | R Documentation |
High Dimensional Discriminant Analysis
Description
Performs Linear Discriminant Analysis in High Dimensional problems based on reliable covariance estimators for problems with (many) more variables than observations. Includes routines for classifier training, prediction, cross-validation and variable selection.
Details
Package: | HiDimDA |
Type: | Package |
Version: | 0.2-6 |
Date: | 2024-02-25 |
License: | GPL-3 |
LazyLoad: | yes |
LazyData: | yes |
HiDimDA is a package for High-Dimensional Discriminant Analysis aimed at problems with many variables, possibly much more than the number of available observations. Its core consists of the four Linear Discriminant Analyis routines:
Dlda: | Diagonal Linear Discriminant Analysis |
Slda: | Shrunken Linear Discriminant Analysis |
Mlda: | Maximum-uncertainty Linear Discriminant Analysis |
RFlda: | Factor-model Linear Discriminant Analysis |
and the variable selection routine:
SelectV: | High-Dimensional variable selection for supervised classification |
that selects variables to be used in a Discriminant classification rule by ranking them according to two-sample t-scores (problems with two-groups), or ANOVA F-scores (problems wih more that two groups), and discarding those with scores below a threshold defined by the Higher Criticism (HC) approach of Donoho and Jin (2008), the Expanded Higher Criticism scheme proposed by Duarte Silva (2011), False Discovery Rate (Fdr) control as suggested by Benjamini and Hochberg (1995), the FAIR approach of Fan and Fan (2008), or simply by fixing the number of retained variables to some pre-defined constant.
All four discriminant routines, ‘Dlda’, ‘Slda’, ‘Mlda’ and ‘RFlda’, compute Linear Discriminant Functions, by default after a preliminary variable selection step, based on alternative estimators of a within-groups covariance matrix that leads to reliable allocation rules in problems where the number of selected variables is close to, or larger than, the number of available observations.
Consider a Discriminant Analysis problem with k
groups, p
selected variables, a training sample consisting
of N = \sum_{g=1}^{k}n_g
observations with group and overall means,
\bar{X}_g
and \bar{X}_.
, and a between-groups scatter (scaled by degrees of freedom)
matrix, S_B = \frac{1}{N-k} \sum_{g=1}^{k} n_g (\bar{X}_g -\bar{X}_.)(\bar{X}_g -\bar{X}_.)^T
Following the two main classical approaches to Linear Discrimant Analysis, the Discriminant Functions returned by HiDimDA discriminant routines are either based on the canonical linear discriminants given by the normalized eigenvectors
LD_j = Egvct_j (S_B \hat{\Sigma}_W^{-1})
j = 1,...,r=min(p,k-1)
[LD_1, ..., LD_r]^T \hat{\Sigma}_W [LD_1, ..., LD_r] = I_r
or the classification functions
CF_g = (\bar{X}_g - \bar{X}_1) \hat{\Sigma}_W^{-1}
g = 2,...,k
where \hat{\Sigma}_W^{-1}
is an estimate of the inverse within-groups covariance.
It is well known that these two approaches are equivalent, in the sense that classification rules that assign new observations to
the group with the closest (according to the Euclidean distance) centroid in the space of the canonical variates,
Z = [LD_1 ... LD_r]^T X
, give the same results as the rule that assigns a new observation to group 1 if all classification scores,
Clscr_g = CF_g^T X - CF_g^T \frac{(\bar{X}_1 + \bar{X}_g)}{2}
, are negative, and to the group with the highest classification
score otherwise.
The discriminant routines of HiDimDA compute canonical linear discriminant functions by default, and classification functions when
the argument ‘ldafun’ is set to “classification”. However, unlike traditional linear discriminant analysis where
\Sigma_W^{-1}
is estimated by the inverse of the sample covariance,
which is not well-defined when p \geq N-k
and is unreliable if p
is close to N-k
, the routines of HiDimDA use
four alternative well-conditioned estimators of \Sigma_W^{-1}
that lead to reliable classification rules if p
is larger than,
or close to, N-k
.
In particular, ‘Dlda’ estimates \Sigma_W^{-1}
by the diagonal matrix of inverse sample variances, ‘Slda’ by
the inverse of an optimally shrunken Ledoit and Wolf's (2004) covariance estimate with the targets and optimal
target intensity estimators proposed by Fisher and Sun (2011), ‘Mlda’ uses a regularized inverse
covariance that deemphasizes the importance given to the last eigenvectors of the sample covariance (see Thomaz, Kitani
and Gillies (2006) for details), and ‘RFlda’ uses a factor model estimate of the true inverse correlation (or covariance)
matrix based on the approach of Duarte Silva (2011).
The HiDimDA package also includes predict methods for all discriminant routines implemented, a routine (‘DACrossVal’) for asssessing the quality of the classification results by kfold cross-validation, and utilities for storing, extracting and efficiently handling specialized high-dimensional covariance and inverse covariance matrix estimates.
Author(s)
Antonio Pedro Duarte Silva <psilva@porto.ucp.pt>
Maintainer: Antonio Pedro Duarte Silva <psilva@porto.ucp.pt>
References
Benjamini, Y. and Hochberg, Y. (1995) “Controling the false discovery rate: A practical and powerful approach to multiple testing”, Journal of the Royal Statistical Society B, 57, 289-300.
Donoho, D. and Jin, J. (2008) “Higher criticism thresholding: Optimal feature selection when useful features are rare and weak”, In: Proceedings National Academy of Sciences, USA 105, 14790-14795.
Fan, J. and Fan, Y. (2008) “High-dimensional classification using features annealed independence rules”, Annals of Statistics, 36 (6), 2605-2637.
Fisher, T.J. and Sun, X. (2011) “Improved Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix”, Computational Statistics and Data Analysis, 55 (1), 1909-1918.
Ledoit, O. and Wolf, M. (2004) “A well-conditioned estimator for large-dimensional covariance matrices.”, Journal of Multivariate Analysis, 88 (2), 365-411.
Pedro Duarte Silva, A. (2011) “Two Group Classification with High-Dimensional Correlated Data: A Factor Model Approach”, Computational Statistics and Data Analysis, 55 (1), 2975-2990.
Thomaz, C.E. Kitani, E.C. and Gillies, D.F. (2006) “A maximum uncertainty LDA-based approach for limited sample size problems - with application to face recognition”, Journal of the Brazilian Computer Society, 12 (2), 7-18
See Also
Dlda
, Mlda
, Slda
,RFlda
, predict.canldaRes
, predict.clldaRes
, AlonDS
Examples
# train the four main classifiers with their default setings
# on Alon's colon data set (after a logarithmic transformation),
# selecting genes by the Expanded HC scheme
# Pre-process and select the genes to be used in the classifiers
log10genes <- log10(AlonDS[,-1])
SelectionRes <- SelectV(log10genes,AlonDS$grouping)
genesused <- log10genes[SelectionRes$vkpt]
# Train classifiers
DiaglldaRule <- Dlda(genesused,AlonDS$grouping)
FactldaRule <- RFlda(genesused,AlonDS$grouping)
MaxUldaRule <- Mlda(genesused,AlonDS$grouping)
ShrkldaRule <- Slda(genesused,AlonDS$grouping)
# Get in-sample classification results
predict(DiaglldaRule,genesused,grpcodes=levels(AlonDS$grouping))$class
predict(FactldaRule,genesused,grpcodes=levels(AlonDS$grouping))$class
predict(MaxUldaRule,genesused,grpcodes=levels(AlonDS$grouping))$class
predict(ShrkldaRule,genesused,grpcodes=levels(AlonDS$grouping))$class
# Compare classifications with true assignments
cat("Original classes:\n")
print(AlonDS$grouping)
# Show set of selected genes
cat("Genes kept in discrimination rule:\n")
print(colnames(genesused))
cat("Number of selected genes =",SelectionRes$nvkpt,"\n")