linda {MicrobiomeStat}R Documentation

Linear (Lin) Model for Differential Abundance (DA) Analysis of High-dimensional Compositional Data

Description

The function implements a simple, robust and highly scalable approach to tackle the compositional effects in differential abundance analysis of high-dimensional compositional data. It fits linear regression models on the centered log2-ratio transformed data, identifies a bias term due to the transformation and compositional effect, and corrects the bias using the mode of the regression coefficients. It could fit mixed-effect models for analysis of correlated data.

Usage

linda(
  feature.dat,
  meta.dat,
  formula,
  feature.dat.type = c('count', 'proportion'),
  prev.filter = 0,
  mean.abund.filter = 0, 
  max.abund.filter = 0,
  is.winsor = TRUE,
  outlier.pct = 0.03,
  adaptive = TRUE,
  zero.handling = c('pseudo-count', 'imputation'),
  pseudo.cnt = 0.5,
  corr.cut = 0.1,
  p.adj.method = "BH",
  alpha = 0.05,
  n.cores = 1, 
  verbose = TRUE
)

Arguments

feature.dat

a matrix of counts/proportions, row - features (OTUs, genes, etc) , column - samples.

meta.dat

a data frame containing the sample meta data. If there are NAs, the corresponding samples will be removed in the analysis.

formula

a character string for the formula. The formula should conform to that used by lm (independent data) or lmer (correlated data). For example: formula = '~x1*x2+x3+(1|id)'. At least one fixed effect is required.

feature.dat.type

the type of the feature data. It could be "count" or "proportion".

prev.filter

the prevalence (percentage of non-zeros) cutoff, under which the features will be filtered. The default is 0.

mean.abund.filter

the mean relative abundance cutoff, under which the features will be filtered. The default is 0.

max.abund.filter

the max relative abundance cutoff, under which the features will be filtered. The default is 0.

is.winsor

a logical value indicating whether winsorization should be performed to replace outliers (high values). The default is TRUE.

outlier.pct

the expected percentage of outliers. These outliers will be winsorized. The default is 0.03.

adaptive

a logical value indicating whether the approach to handle zeros (pseudo-count or imputation) will be determined based on the correlations between the log(sequencing depth) and the explanatory variables in formula when feature.dat is 'count'. If TRUE and the correlation p-value for any explanatory variable is smaller than or equal to corr.cut, the imputation approach will be used; otherwise, the pseudo-count approach will be used.

zero.handling

a character string of 'pseudo-count' or 'imputation' indicating the zero handling method used when feature.dat is 'count'. If 'pseudo-count', apseudo.cnt will be added to each value in feature.dat. If 'imputation', then we use the imputation approach using the formula in the referenced paper. Basically, zeros are imputed with values proportional to the sequencing depth. When feature.dat is 'proportion', this parameter will be ignored and zeros will be imputed by half of the minimum for each feature.

pseudo.cnt

a positive numeric value for the pseudo-count to be added if zero.handling is 'pseudo-count'. Default is 0.5.

corr.cut

a numerical value between 0 and 1, indicating the significance level used for determining the zero-handling approach when adaptive is TRUE. Default is 0.1.

p.adj.method

a character string indicating the p-value adjustment approach for addressing multiple testing. See R function p.adjust. Default is 'BH'.

alpha

a numerical value between 0 and 1 indicating the significance level for declaring differential features. Default is 0.05.

n.cores

a positive integer. If n.cores > 1 and formula is in a form of mixed-effect model, n.cores parallels will be conducted. Default is 1.

verbose

a logical value indicating whether the trace information should be printed out.

Value

A list with the elements

variables

A vector of variable names of all fixed effects in formula. For example: formula = '~x1*x2+x3+(1|id)'. Suppose x1 and x2 are numerical, and x3 is a categorical variable of three levels: a, b and c. Then the elements of variables would be ('x1', 'x2', 'x3b', 'x3c', 'x1:x2').

bias

numeric vector; each element corresponds to one variable in variables; the estimated bias of the regression coefficients due to the compositional effect.

output

a list of data frames with columns 'baseMean', 'log2FoldChange', 'lfcSE', 'stat', 'pvalue', 'padj', 'reject', 'df'; names(output) is equal to variables; the rows of the data frame corresponds to taxa. Note: if there are taxa being excluded due to prev.cut, the number of the rows of the output data frame will be not equal to the number of the rows of otu.tab. Taxa are identified by the rownames. If the rownames of otu.tab are NULL, then 1 : nrow(otu.tab) is set as the rownames of otu.tab.

baseMean:

2 to the power of the intercept coefficients (normalized by one million)

log2FoldChange:

bias-corrected coefficients

lfcSE:

standard errors of the coefficients

stat:

log2FoldChange / lfcSE

pvalue:

2 * pt(-abs(stat), df)

padj:

p.adjust(pvalue, method = p.adj.method)

reject:

padj <= alpha

df:

degrees of freedom. The number of samples minus the number of explanatory variables (intercept included) for fixed-effect models; estimates from R package lmerTest with Satterthwaite method of approximation for mixed-effect models.

covariance

a list of data frames; the data frame records the covariances between a regression coefficient with other coefficients; names(covariance) is equal to variables; the rows of the data frame corresponds to taxa. If the length of variables is equal to 1, then the covariance is NULL.

otu.tab.use

the OTU table used in the abundance analysis (the otu.tab after the preprocessing: samples that have NAs in the variables in formula or have less than lib.cut read counts are removed; taxa with prevalence less than prev.cut are removed and data is winsorized if !is.null(winsor.quan); and zeros are treated, i.e., imputed or pseudo-count added).

meta.use

the meta data used in the abundance analysis (only variables in formula are stored; samples that have NAs or have less than lib.cut read counts are removed; numerical variables are scaled).

wald

a list for use in Wald test. If the fitting model is a linear model, then it includes

beta:

a matrix of the biased regression coefficients including intercept and all fixed effects; the culumns correspond to taxa

sig:

the standard errors; the elements corresponding to taxa

X:

the design matrix of the fitting model

bias:

the estimated biases of the regression coefficients including intercept and all fixed effects

If the fitting model is a linear mixed-effect model, then it includes

beta:

a matrix of the biased regression coefficients including intercept and all fixed effects; the culumns correspond to taxa

beta.cov:

a list of covariance matrices of beta; the elements corresponding to taxa

rand.cov:

a list with covariance matrices of variance parameters of random effects; the elements corresponding to taxa; see more details in the paper of 'lmerTest'

Joc.beta.cov.rand:

a list of a list of Jacobian matrices of beta.cov with respect to the variance parameters; the elements corresponding to taxa

bias:

the estimated biases of the regression coefficients including intercept and all fixed effects

Author(s)

Huijuan Zhou, Jun Chen, Xianyang Zhang

References

Zhou, H., He, K., Chen, J., Zhang, X. (2022). LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome biology, 23(1), 95.

Examples


data(smokers)

ind <- smokers$meta$AIRWAYSITE == 'Throat'
otu.tab <- as.data.frame(smokers$otu[, ind])
depth <- colSums(otu.tab)
meta <- cbind.data.frame(Smoke = factor(smokers$meta$SMOKER[ind]),
                         Sex = factor(smokers$meta$SEX[ind]),
                         Site = factor(smokers$meta$SIDEOFBODY[ind]),
                         SubjectID = factor(smokers$meta$HOST_SUBJECT_ID[ind]))

# Differential abundance analysis using the left throat data                       
ind1 <- meta$Site == 'Left' & depth >= 1000
linda.obj  <- linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex', 
           feature.dat.type = 'count', 
           prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03,
           p.adj.method = "BH", alpha = 0.1)
           
           


linda.plot(linda.obj, c('Smokey', 'Sexmale'),
           titles = c('Smoke: n v.s. y', 'Sex: female v.s. male'), 
           alpha = 0.1, lfc.cut = 1, legend = TRUE, directory = NULL,
            width = 11, height = 8)
            
rownames(linda.obj $output[[1]])[which(linda.obj $output[[1]]$reject)]


# Differential abundance analysis pooling both the left and right throat data 
# Mixed effects model is used 

ind  <- depth >= 1000
linda.obj <- linda(otu.tab[, ind], meta[ind, ], formula = '~Smoke+Sex+(1|SubjectID)',
           feature.dat.type = 'count', 
           prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03,
           p.adj.method = "BH", alpha = 0.1)

       
    
# For proportion data   
otu.tab.p <- t(t(otu.tab) / colSums(otu.tab))
ind1 <- meta$Site == 'Left' & depth >= 1000
lind.obj  <- linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex', 
           feature.dat.type = 'proportion', 
           prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03,
           p.adj.method = "BH", alpha = 0.1)


[Package MicrobiomeStat version 1.2 Index]