R: Main function to run POMS pipeline

POMS_pipeline {POMS}

R Documentation

Main function to run POMS pipeline

Description

See details below.

Usage

POMS_pipeline(
  abun,
  func,
  tree,
  group1_samples = NULL,
  group2_samples = NULL,
  ncores = 1,
  pseudocount = 1,
  manual_BSNs = NULL,
  manual_balances = NULL,
  manual_BSN_dir = NULL,
  min_num_tips = 10,
  min_func_instances = 10,
  min_func_prop = 0.001,
  multinomial_min_FSNs = 5,
  derep_nodes = FALSE,
  jaccard_cutoff = 0.75,
  BSN_p_cutoff = 0.05,
  BSN_correction = "none",
  FSN_p_cutoff = 0.05,
  FSN_correction = "none",
  func_descrip_infile = NULL,
  multinomial_correction = "BH",
  detailed_output = FALSE,
  verbose = FALSE
)

Arguments

`abun`	dataframe of taxa abundances that are at the tips of the input tree. These taxa are usually individual genomes. The taxa need to be the rows and the samples the columns.
`func`	dataframe of the number of copies of each function that are encoded by each input taxon. This pipeline only considers the presence/absence of functions across taxa. Taxa (with row names intersecting with the "abun" table) should be the rows and the functions should be the columns.
`tree`	phylo object with tip labels that match the row names of the "abun" and "func" tables. This object is usually based on a newick-formatted tree that has been read into R with the ape R package.
`group1_samples`	character vector of column names of "abun" table that correspond to the first sample group. This grouping is used for testing for significant sample balances at each node. Required unless the "manual_BSN_dir" argument is set (i.e., if the binary directions of BSNs are specified manually).
`group2_samples`	same as "group1_samples", but corresponding to the second sample group.
`ncores`	integer specifying how many cores to use for parallelized sections of pipeline.
`pseudocount`	number added to all cells of "abun" table to avoid 0 values. Set this to be 0 if this is not desired. Note that there will be issues with the balance tree approach if any 0's are present.
`manual_BSNs`	optional vector of node names that match node labels of input tree. These nodes will be considered the set of balance-significant nodes, and the Wilcoxon tests will not be run. The group means of the balances at each node will still be used to determine which group has higher values. Note this requires that the "manual_balances" argument is also specified.
`manual_balances`	optional list of balance values which represent the balances at all tested nodes that resulted in the input to the manual_BSNs vector. This list must include balances for all nodes in the manual_BSNs vector, but also all non-significant tested nodes as well. These node labels must all be present in the input tree. The required list format is the "balances" object in the output of compute_node_balances. Note, however, that any approach for computing balances could be used, as long as they are in this list format.
`manual_BSN_dir`	optional character vector specifying "group1" or "group2", depending on the direction of the BSN difference. This must be a named vector, with all names matching the set of nodes specified by the manual_BSNs argument. Although this requires that the exact labels "group1" or "group2" are specified, these categories could represent different binary divisions rather than strict sample groups. For instance, "group1" could be used to represent nodes where sample balances are positively associated with a continuous variable (rather than a discrete grouping), whereas "group2" could represent nodes where sample balances are negatively associated.
`min_num_tips`	minimum number of tips on each side of the nodes that is required for them to be retained in the analysis. This argument is ignored if significant nodes are specified manually.
`min_func_instances`	minimum number of tips that must encode the function for it to be retained for the analysis.
`min_func_prop`	minimum proportion of tips that must encode the function for it to be retained for the analysis.
`multinomial_min_FSNs`	The minimum number of FSNs required to run a multinomial test for a given function.
`derep_nodes`	boolean value specifying whether nodes should be dereplicated based on similar sets of underlying tips (EXPERIMENTAL setting). More specifically, whether nodes should be clustered based on how similar their underlying tips are (given a Jaccard index cut-off, specified as separately), and then only retaining the node with the fewest underlying tips per cluster.
`jaccard_cutoff`	Numeric vector of length 1. Must be between 0 and 1 (inclusive). Corresponds to the Jaccard cut-off used for clustering nodes based on similar sets of underlying tips (when derep_nodes = TRUE).
`BSN_p_cutoff`	significance cut-off for identifying BSNs.
`BSN_correction`	multiple-test correction to use on Wilcoxon test p-values when identifying BSNs. Must be in p.adjust.methods.
`FSN_p_cutoff`	significance cut-off for identifying FSNs.
`FSN_correction`	multiple-test correction to use on Fisher's exact test p-values when identifying FSNs. Must be in p.adjust.methods.
`func_descrip_infile`	optional path to mapfile of function ids (column 1) to descriptions (column 2). This should be tab-delimited with no header and one function per line. If this option is specified then an additional description column will be added to the output table.
`multinomial_correction`	multiple-test correction to use on raw multinomial test p-values. Must be in p.adjust.methods.
`detailed_output`	boolean flag to indicate that several intermediate objects should be included in the final output. This is useful when troubleshooting issues, but is not expected to be useful for most users. The additional results include: balance_comparisons (summary of Wilcoxon tests on balances) func_enrichments (Fisher's exact test output for all functions at each node) input_param (a list containing the specified input parameters)
`verbose`	boolean flag to indicate that log information should be written to the console.

Details

Identifies significant nodes based on sample balances, using a Wilcoxon test by default. Alternatively, significant nodes can be manually specified. Either way, significant nodes based on sample balances are referred to as Balance-Significant Nodes (BSNs).

Fisher's exact tests are run at each node in the tree with sufficient numbers of underlying tips on each side to test for functional enrichment. Significant nodes based on this test are referred to as Function-Significant Nodes (FSNs). The set of FSNs is determined independently for each tested function.

The key output is the tally of the intersecting nodes based on the sets of BSNs and FSNs.

Each FSN can be categorized in one of three ways:

It does not intersect with any BSN.
It intersects with a BSN and the functional enrichment is within the taxa that are relatively more abundant in group 1 samples.
Same as the second point, but enriched within taxa that are relatively more abundant in group 2 samples.

A multinomial test is run to see if the number of FSNs of each type is significantly different from the random expectation.

Value

list containing (at minimum) these elements:

results: dataframe with each tested function as a row and the numbers of FSNs of each type as columns, as well as the multinomial test output.
balance_info: list containing the tips underlying each node, which were what the balances are based on, the balances themselves at each tested node, and the set of nodes that were determined to be negligible due to having too few underlying tips. Note that the balances and underlying tips are provided for all non-negligible (i.e., tested) nodes, not just those identified as BSNs. Additional information on the dereplication and Jaccard similarity of nodes is returned as well when derep_nodes = TRUE.
BSNs: character vector with BSNs as names and values of "group1" and "group2" to indicate for which sample group (or other binary division) the sample balances were higher.
FSNs_summary: list containing each tested function as a separate element. The labels for nodes in each FSN category of the multinomial test are listed per function (or are empty if there were no such FSNs).
tree: the prepped tree used by the pipeline, including the added node labels if a tree lacking labels was provided. This tree will also have been subset to only those tips found in the abundance table, and midpoint rooted (if it was not already rooted).
multinomial_exp_prop: expected proportions of the three FSN categories used for multinomial test.

[Package POMS version 1.0.1 Index]