sda.ranking {sda} | R Documentation |
Shrinkage Discriminant Analysis 1: Predictor Ranking
Description
sda.ranking
determines a ranking of predictors by computing CAT scores
(correlation-adjusted t-scores)
between the group centroids and the pooled mean.
plot.sda.ranking
provides a graphical visualization of the top ranking features.
Usage
sda.ranking(Xtrain, L, lambda, lambda.var, lambda.freqs,
ranking.score=c("entropy", "avg", "max"),
diagonal=FALSE, fdr=TRUE, plot.fdr=FALSE, verbose=TRUE)
## S3 method for class 'sda.ranking'
plot(x, top=40, arrow.col="blue", zeroaxis.col="red",
ylab="Features", main, ...)
Arguments
Xtrain |
A matrix containing the training data set. Note that the rows correspond to observations and the columns to variables. |
L |
A factor with the class labels of the training samples. |
lambda |
Shrinkage intensity for the correlation matrix. If not specified it is
estimated from the data. |
lambda.var |
Shrinkage intensity for the variances. If not specified it is
estimated from the data. |
lambda.freqs |
Shrinkage intensity for the frequencies. If not specified it is
estimated from the data. |
diagonal |
Chooses between LDA (default, |
ranking.score |
how to compute the summary score for each variable from the CAT scores of all classes - see Details. |
fdr |
compute FDR values and HC scores for each feature. |
plot.fdr |
Show plot with estimated FDR values. |
verbose |
Print out some info while computing. |
x |
An "sda.ranking" object – this is produced by the sda.ranking() function. |
top |
The number of top-ranking features shown in the plot (default: 40). |
arrow.col |
Color of the arrows in the plot (default is |
zeroaxis.col |
Color for the center zero axis (default is |
ylab |
Label written next to feature list (default is |
main |
Main title (if missing, |
... |
Other options passed on to generic plot(). |
Details
For each predictor variable and centroid a shrinkage CAT scores of the mean versus the pooled mean is computed. If there are only two classes the CAT score vs. the pooled mean reduces to the CAT score between the two group means. Moreover, in the diagonal case (LDA) the (shrinkage) CAT score reduces to the (shrinkage) t-score.
The overall ranking of a feature is determine by computing a summary score from the CAT scores.
This is controlled by the option ranking.score
. The default setting
(ranking.score="entropy"
) uses mutual information
between the response and the respective predictors (ranking.score
) for ranking. This is equivalent to
a weighted sum of squared CAT scores across the classes. Another possibility is to employ
the average of the squared CAT scores for ranking (as suggested in Ahdesm\"aki and Strimmer 2010)
by setting ranking.score="avg"
. A third option is to use the maximum of the squared CAT scores across groups (similarly as in the PAM algorithm) via setting ranking.score="max"
.
Note that in the case of two classes all three options are equivalent and
lead to identical scores. Thus, the choice of ranking.score
is important only
in the multi-class setting. In the two-class case the features are simply ranked according to the
(shrinkage) squared CAT-scores (or t-scores, if there is no correlation among predictors).
The current default approach is to use ranking by mutual information (i.e. relative entropy
between full model vs. model without predictor) and to use shrinkage estimators of frequencies.
In order to reproduce exactly the ranking computed by previous versions (1.1.0 to 1.3.0) of the sda
package set the options ranking.score="avg"
and lambda.freqs=0
.
Calling sda.ranking
is step 1 in a classification analysis with the
sda package. Steps 2 and 3 are
sda
and predict.sda
See Zuber and Strimmer (2009) for CAT scores in general, and Ahdesm\"aki and Strimmer (2010) for details on multi-class CAT scores. For shrinkage t scores see Opgen-Rhein and Strimmer (2007).
Value
sda.ranking
returns a matrix with the following columns:
idx |
original feature number |
score |
sum of the squared CAT scores across groups - this determines the overall ranking of a feature |
cat |
for each group and feature the cat score of the centroid versus the pooled mean |
If fdr=TRUE
then additionally local false discovery rate (FDR) values
as well as higher criticism (HC) scores are computed for each feature
(using fdrtool
).
Author(s)
Miika Ahdesm\"aki, Verena Zuber, Sebastian Gibb, and Korbinian Strimmer (https://strimmerlab.github.io).
References
Ahdesm\"aki, A., and K. Strimmer. 2010. Feature selection in omics prediction problems using cat scores and false non-discovery rate control. Ann. Appl. Stat. 4: 503-519. <DOI:10.1214/09-AOAS277>
Opgen-Rhein, R., and K. Strimmer. 2007. Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Statist. Appl. Genet. Mol. Biol. 6:9. <DOI:10.2202/1544-6115.1252>
Zuber, V., and K. Strimmer. 2009. Gene ranking and biomarker discovery under correlation. Bioinformatics 25: 2700-2707. <DOI:10.1093/bioinformatics/btp460>
See Also
Examples
# load sda library
library("sda")
#################
# training data #
#################
# prostate cancer set
data(singh2002)
# training data
Xtrain = singh2002$x
Ytrain = singh2002$y
#########################################
# feature ranking (diagonal covariance) #
#########################################
# ranking using t-scores (DDA)
ranking.DDA = sda.ranking(Xtrain, Ytrain, diagonal=TRUE)
ranking.DDA[1:10,]
# plot t-scores for the top 40 genes
plot(ranking.DDA, top=40)
# number of features with local FDR < 0.8
# (i.e. features useful for prediction)
sum(ranking.DDA[,"lfdr"] < 0.8)
# number of features with local FDR < 0.2
# (i.e. significant non-null features)
sum(ranking.DDA[,"lfdr"] < 0.2)
# optimal feature set according to HC score
plot(ranking.DDA[,"HC"], type="l")
which.max( ranking.DDA[1:1000,"HC"] )
#####################################
# feature ranking (full covariance) #
#####################################
# ranking using CAT-scores (LDA)
ranking.LDA = sda.ranking(Xtrain, Ytrain, diagonal=FALSE)
ranking.LDA[1:10,]
# plot t-scores for the top 40 genes
plot(ranking.LDA, top=40)
# number of features with local FDR < 0.8
# (i.e. features useful for prediction)
sum(ranking.LDA[,"lfdr"] < 0.8)
# number of features with local FDR < 0.2
# (i.e. significant non-null features)
sum(ranking.LDA[,"lfdr"] < 0.2)
# optimal feature set according to HC score
plot(ranking.LDA[,"HC"], type="l")
which.max( ranking.LDA[1:1000,"HC"] )