diffexprm {bapred}R Documentation

Measure for performance of differential expression analysis (after batch effect adjustment)


This metric is similar to the idea presented in Lazar et al (2012) which consists in comparing the list of the most differentially expressed genes obtained using a batch effect adjusted dataset to the list obtained using an independent dataset. For each batch the following is done by diffexprm: 1) the respective batch is left out and batch effect adjustment is performed using the remaining batches; 2) differential expression analysis is performed once using the left-out batch and once using the remaining batch-effect adjusted data; 3) the overlap between the two lists of genes found differentially expressed in the two subsets is measured. See below for further details.


diffexprm(x, batch, y, method = c("fabatch", "combat", "sva", 
  "meancenter", "standardize", "ratioa", "ratiog", "none"))



matrix. The covariate matrix. Observations in rows, variables in columns.


factor. Batch variable. Currently has to have levels: '1', '2', '3' and so on.


factor. Binary target variable. Currently has to have levels '1' and '2'.


character. Method for batch effect adjustment. The following are supported: fabatch, combat, fsva, meancenter, standardize, ratioa, ratiog and none


The following procedure is performed: 1) For each batch j leave this batch out and perform batch effect adjustment on the rest of the dataset. Derive two lists of the 5 percent of variables which are most differentially expressed (see next paragraph): one using the batch effect adjusted dataset - where batch j was left out - and one using the data from batch j. Calculate the number of variables appearing in both lists and divide this number by the common length of the lists. 2) Calculate a weighted average of the values obtained in 1) with weights proportional to the number of observations in the corresponding left-out batches.

Differential expression is measured as follows. For each variable a randomized p-value out of the Whitney-Wilcoxon rank sum test is drawn, see Geyer and Meeden (2005) for details. Then those 5 percent variables are considered differentially expressed, which are associated with the smallest p-values.


Value of the metric


The larger the values of this metric, the better.


Roman Hornung


Hornung, R., Boulesteix, A.-L., Causeur, D. (2016) Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment. BMC Bioinformatics 17:27.

Lazar, C., Meganck, S., Taminau, J., Steenhoff, D., Coletta, A., Molter,C., Weiss-Solís, D. Y., Duque, R., Bersini, H., Nowé, A. (2012) Batch effect removal methods for microarray gene expression data integration: a survey. Briefings in Bioinformatics, 14(4), 469-490.

Geyer, C. J., Meeden, G., D. (2005) Fuzzy and randomized confidence intervals and p-values (with discussion). Statistical Science, 20(4), 358-387.



# Random subset of 150 variables:
Xsub <- X[,sample(1:ncol(X), size=150)]

# In cases of batches with more than 20 observations
# select 20 observations at random:
subinds <- unlist(sapply(1:length(levels(batch)), function(x) {
  indbatch <- which(batch==x)
  if(length(indbatch) > 20)
    indbatch <- sort(sample(indbatch, size=20))
Xsub <- Xsub[subinds,]
batchsub <- batch[subinds]
ysub <- y[subinds]

diffexprm(x=Xsub, batch=batchsub, y=ysub, method = "ratiog")
diffexprm(x=Xsub, batch=batchsub, y=ysub, method = "none")

[Package bapred version 1.0 Index]