R: A linear Model-based Permutation Test for Differential...

ZicoSeq {GUniFrac}

R Documentation

A linear Model-based Permutation Test for Differential Abundance Analysis of Microbiome Data and Other Omics Data

Description

ZicoSeq is a permutation test (Smith permutation) for differential abundance analysis of microbiome sequencing data. The input can be a count or a proportion matrix. When a count matrix is provided, it provides an option to draw posterior samples of the underlying proportions to account for the sampling variability during the sequencing process. The test results are aggregated over these posterior samples. For both count and proportion data, a reference-based ratio approach is used to account for compositional effects. As a general methodology, ZicoSeq can also be applied to differential analysis of other omics data. In this case, they are not treated as compositional data.

Usage

ZicoSeq(
  meta.dat,
  feature.dat,
  grp.name,
  adj.name = NULL,
  feature.dat.type = c('count', 'proportion', 'other'),
  prev.filter = 0,
  mean.abund.filter = 0,
  max.abund.filter = 0,
  min.prop = 0,
  is.winsor = TRUE,
  outlier.pct = 0.03,
  winsor.end = c('top', 'bottom', 'both'),
  is.post.sample = TRUE,
  post.sample.no = 25,
  link.func = list(function(x) sign(x) * (abs(x))^0.5),
  stats.combine.func = max,
  perm.no = 99,
  strata = NULL,
  ref.pct = 0.5,
  stage.no = 6,
  excl.pct = 0.2,
  p.max = 500,
  is.fwer = FALSE,
  verbose = TRUE,
  return.feature.dat = TRUE
)

Arguments

`meta.dat`	a data frame containing the sample meta data.
`feature.dat`	a matrix of feature data, row - features (OTUs, genes, etc) , column - samples.
`grp.name`	the name for the variable of interest. It could be numeric or categorical; should be in `meta.dat`.
`adj.name`	the name(s) for the variable(s) to be adjusted. Multiple variables are allowed. They could be numeric or categorical; should be in `meta.dat`.
`feature.dat.type`	the type of the feature data. It could be "count", "proportion" or "other". For "proportion" data type, posterior sampling will not be performed, but the reference-based ratio approach will still be used to address compositional effects. For "other" data type, neither posterior sampling or reference-base ratio approach will be used.
`prev.filter`	the prevalence (percentage of nonzeros) cutoff, under which the features will be filtered. The default is 0.
`mean.abund.filter`	the mean relative abundance cutoff, under which the features will be filtered. The default is 0.
`max.abund.filter`	the max relative abundance cutoff, under which the features will be filtered. The default is 0.
`min.prop`	proportions less than this value will be replaced with this value. Only relevant when log transformation is used. Default is 0.
`is.winsor`	a logical value indicating whether winsorization should be performed to replace outliers. The default is TRUE.
`outlier.pct`	the expected percentage of outliers. These outliers will be winsorized. The default is 0.03. For count/proportion data, `outlier.pct` should be less than `prev.filter`.
`winsor.end`	a character indicating whether the outliers at the "top", "bottom" or "both" will be winsorized. The default is "top". If the `feature.dat.type` is "other", "both" may be considered.
`is.post.sample`	a logical value indicating whether to perform posterior sampling of the underlying proportions. Only relevant when the feature data are counts.
`post.sample.no`	the number of posterior samples if posterior sampling is used. The default is 25.
`link.func`	a list of transformation functions for the feature data or the ratios. Based on our experience, square-root transformation is a robust choice for many datasets.
`perm.no`	the number of permutations. If the raw p values are of the major interest, set `perm.no` to at least 999.
`strata`	a factor such as subject IDs indicating the permutation strata or characters indicating the strata variable in `meta.dat`. Permutation will be confined to each stratum. This can be used for paired or some longitudinal designs.
`stats.combine.func`	function to combine the F-statistic for the omnibus test. The default is `max`.
`ref.pct`	percentage of reference taxa. The default is 0.5.
`p.max`	the maximum number of (most abundant) taxa to be considered in reference taxa selection; only relevant when the nubmer of taxa is huge. The default is 500, i.e., when the nubmer of taxa is larger than 500, only the 500 most abundant taxa will be used for reference selection. This is to reduce the computational time.
`stage.no`	the number of stages if multiple-stage normalization is used. The default is 6.
`excl.pct`	the maximum percentage of significant features (nominal p-value < 0.05) in the reference set that should be removed. Only relevant when multiple-stage normalization is used.
`is.fwer`	a logical value indicating whether the family-wise error rate control (West-Young) should be performed.
`verbose`	a logical value indicating whether the trace information should be printed out.
`return.feature.dat`	a logical value indicating whether the winsorized, filtered "feature.dat" matrix should be returned.

Details

ZicoSeq is a linear model-based permutation test developed for differential abundance analysis of zero-inflated compositional data. Although its development is motivated by zero-inflated microbiome sequence count data, it can be applied to proportion (composition) data and more generally to other types of omics data. Currently, it has the following components: 1. Winsorization to decrease the influence of outliers; 2. Posterior sampling based on a beta mixture prior to address sampling variability and zero inflation; 3. Reference-based multiple-stage normalization to address compositional effects; 4. An omnibus test to address diverse feature-covariate relationships; 5. Permutation-based false discovery rate control / family-wise error rate control for multiple testing correction, which takes into account the correlation structure in the feature data.

Value

A list with the elements

`call`	the call
`feature.dat`	the winsorized, filtered `feature.dat` matrix.
`meta.dat`	`meta.dat` used.
`grp.name`	the name of the variable of interest.
`filter.features`	a vector of the names of the features that are filtered.
`ref.features`	a vector of the names of the reference features. Only relevant when reference approach is used.
`R2`	a matrix of percent explained variance (number of features by number of transformation functions).
`F0`	a matrix of F-statistics (number of features by number of transformation functions).
`RSS`	a matrix of residual sum squares (number of features by number of transformation functions).
`df.model`, `df.residual`	degrees of freedom for the model and residual space.
`coef.list`	a list of the linear regression coefficients under the specified transformations.
`p.raw`	the raw p-values based on permutations (not accurate if `perm.no` is small).
`p.adj.fdr`	permutation-based FDR-adjusted p-values.
`p.adj.fwer`	permutation-based FWER-adjusted (West-Young) p-values.

Author(s)

Jun Chen

References

Yang, L. & Chen, J. 2022. A comprehensive evaluation of differential abundance analysis methods: current status and potential solutions. Microbiome, 10(1), 1-23.

Examples


data(throat.otu.tab)
data(throat.tree)
data(throat.meta)

comm <- t(throat.otu.tab)
meta.dat <- throat.meta

set.seed(123)
# For count data
zico.obj <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm, 
		grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "count",
		# Filter to remove rare taxa
		prev.filter = 0.2, mean.abund.filter = 0,  max.abund.filter = 0.002, min.prop = 0, 
		# Winsorization to replace outliers
		is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top',
		# Posterior sampling to impute zeros
		is.post.sample = TRUE, post.sample.no = 25, 
		# Multiple link functions to capture diverse taxon-covariate relation
		link.func = list(function (x) x^0.25, function (x) x^0.5, function (x) x^0.75), 
		stats.combine.func = max,
		# Permutation-based multiple testing correction
		perm.no = 99,  strata = NULL, 
		# Reference-based multiple stage normalization
		ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
		# Family-wise error rate control
		is.fwer = FALSE,
		verbose = TRUE, return.feature.dat = FALSE)

which(zico.obj$p.adj.fdr <= 0.05)

# For proportion data
comm.p <- t(t(comm) / colSums(comm))
zico.obj <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm.p, 
		grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "proportion",
		# Filter to remove rare taxa
		prev.filter = 0.2, mean.abund.filter = 0,  max.abund.filter = 0.002, min.prop = 0, 
		# Winsorization to replace outliers
		is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top',
		# Posterior sampling will be automatically disabled
		is.post.sample = FALSE, post.sample.no = 25, 
		# Use the square-root transformation
		link.func = list(function (x) x^0.5), stats.combine.func = max,
		# Permutation-based multiple testing correction
		perm.no = 99,  strata = NULL, 
		# Reference-based multiple stage normalization
		ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
		# Family-wise error rate control
		is.fwer = FALSE,
		verbose = TRUE, return.feature.dat = FALSE)

which(zico.obj$p.adj.fdr <= 0.05)


# For other type of data. The user should be responsible for the filtering.
comm.o <- comm[rowMeans(comm != 0) >= 0.2, ] + 1
comm.o <- log(t(t(comm.o) / colSums(comm.o)))
zico.obj <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm.o, 
		grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "other",
		# Filter will not be applied
		prev.filter = 0, mean.abund.filter = 0,  max.abund.filter = 0, min.prop = 0, 
		# Winsorization to both ends of the distribution 
		is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'both',
		# Posterior sampling will be automatically disabled
		is.post.sample = FALSE, post.sample.no = 25, 
		# Identity function is used
		link.func = list(function (x) x), stats.combine.func = max,
		# Permutation-based multiple testing correction
		perm.no = 99,  strata = NULL, 
		# Reference-based multiple-stage normalization will not be performed
		ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
		# Family-wise error rate control
		is.fwer = TRUE,
		verbose = TRUE, return.feature.dat = FALSE)

which(zico.obj$p.adj.fdr <= 0.05)

[Package GUniFrac version 1.8 Index]