R: Variable Selection for High-Dimensional Supervised...

SelectV {HiDimDA}

R Documentation

Variable Selection for High-Dimensional Supervised Classification.

Description

Selects variables to be used in a Discriminant Analysis classification rule.

Usage

SelectV(data, grouping, 
Selmethod=c("ExpHC","HC","Fdr","Fair","fixedp"),
NullDist=c("locfdr","Theoretical"), uselocfdr=c("onlyHC","always"), 
minlocfdrp=200, comvar=TRUE, Fdralpha=0.5, 
ExpHCalpha=0.5, HCalpha0=0.1, maxp=ncol(data), tol=1E-12, ...)

Arguments

`data`	Matrix or data frame of observations.
`grouping`	Factor specifying the class for each observation.
`Selmethod`	The method used to choose the number of variables selected. Current alternatives are: ‘ExpHC’ (default) for the Expanded Higher Criticism scheme of Duarte Silva (2011) ‘HC’ for the Higher Criticism (HC) approach of Donoho and Jin (2004, 2008) ‘Fdr’ for the False Discovery Rate control approach of Benjamini and Hochberg (1995) ‘Fair’ for the FAIR (Features Annealed Independence Rules) approach of Fan and Fan (2008). This option is only available for two-group classification problems. ‘fixedp’ for a constant chosen by the user.
`NullDist`	The Null distribution used to compute pvalues from t-scores or F-scores. Current alternatives are “Theoretical” for the corresponding theoretical distributions, and “locfdr” for an empirical Null of z-scores estimated by the maximum likelihood approach of Efron (2004).
`uselocfdr`	Flag indicating the statistics for which the Null empirical distribution estimated by the locfdr approach should be used. Current alternatives are “onlyHC” (default) and “always”.
`minlocfdrp`	Minimum number of variables required to estimate empirical Null distributions by the locfdr method. When the number of variables is below ‘minlocfdrp’, theoretical Nulls are always employed.
`comvar`	Boolean flag indicating if a common group variance is to be assumed (default) in the computation of the t-scores used for problems with two groups.
`Fdralpha`	Control level for variable selection based on False Discovery Rate Control (see Benjamini and Hochberg (1995)).
`ExpHCalpha`	Control level for the first step of the Extended Higher Criticism scheme (see Duarte Silva (2011)).
`HCalpha0`	Proportion of pvalues used to compute the HC statistic (see Donoho and Jin (2004, 2008)).
`maxp`	Maximum number of variables to be used in the discriminant rule.
`tol`	Numerical precision for distinguishing pvalues from 0 and 1. Computed pvalues below ‘tol’ are set to ‘tol’, and those above 1-‘tol’ are set to 1-‘tol’.
`...`	Arguments passed from other methods.

Details

The function ‘SelectV’ selects variables to be used in a Discriminant classification rule by the Higher Criticism (HC) approach of Donoho and Jin (2004, 2008), the Expanded Higher Criticism scheme proposed by Duarte Silva (2011), False Discovery Rate (Fdr) control as suggested by Benjamini and Hochberg (1995), the FAIR (Features Annealed Independence Rules) approach of Fan and Fan (2008), or simply by fixing the number of selected variables to some pre-defined constant.

The Fdr method is, by default, based on simple p-values derived from t-scores (problems with two groups) or ANOVA F-scores (problems with more than two groups). When the argument ‘NullDist’ is set to “Theoretical” these values are also used in the HC method. Otherwise, the HC p-values are derived from an empirical Null of z-scores estimated by the maximum likelihood approach of Efron (2004).

The variable rankings are based on absolute-value t-scores or ANOVA F-scores.

Value

A list with two components:

`nvkpt`	the number of variables to be used in the Discriminant rule
`vkptInd`	the indices of the variables to be used in the Discriminant rule

Author(s)

A. Pedro Duarte Silva

References

Benjamini, Y. and Hochberg, Y. (1995) “Controling the false discovery rate: A practical and powerful approach to multiple testing”, Journal of the Royal Statistical Society B, 57, 289-300.

Donoho, D. and Jin, J. (2004) “Higher criticism for detecting sparse heterogeneous mixtures”, Annals of Statistics 32, 962-964.

Donoho, D. and Jin, J. (2008) “Higher criticism thresholding: Optimal feature selection when useful features are rare and weak”, In: Proceedings National Academy of Sciences, USA 105, 14790-14795.

Efron, B. (2004) “Large-scale simultaneous hypothesis testing: the choice of a null hypothesis”, Journal of the American Statistical Association 99, 96-104.

Fan, J. and Fan, Y. (2008) “High-dimensional classification using features annealed independence rules”, Annals of Statistics, 36 (6), 2605-2637.

Pedro Duarte Silva, A. (2011) “Two Group Classification with High-Dimensional Correlated Data: A Factor Model Approach”, Computational Statistics and Data Analysis, 55 (1), 2975-2990.

Examples


## Not run: 

# Compare the number of variables selected by the four methods 
# currently available  on Alon's Colon Cancer Data set 
# after a logarithmic transformation. 

log10genes <- log10(AlonDS[,-1])

Res <- array(dim=4)
names(Res) <- c("ExpHC","HC","Fdr","Fair")
Res[1] <- SelectV(log10genes,AlonDS[,1])$nvkpt
Res[2] <- SelectV(log10genes,AlonDS[,1],Selmethod="HC")$nvkpt
Res[3] <- SelectV(log10genes,AlonDS[,1],Selmethod="Fdr")$nvkpt
Res[4] <- SelectV(log10genes,AlonDS[,1],Selmethod="Fair")$nvkpt

print(Res)

## End(Not run)

[Package HiDimDA version 0.2-6 Index]