POMS_pipeline {POMS}R Documentation

Main function to run POMS pipeline

Description

See details below.

Usage

POMS_pipeline(
  abun,
  func,
  tree,
  group1_samples = NULL,
  group2_samples = NULL,
  ncores = 1,
  pseudocount = 1,
  manual_BSNs = NULL,
  manual_balances = NULL,
  manual_BSN_dir = NULL,
  min_num_tips = 10,
  min_func_instances = 10,
  min_func_prop = 0.001,
  multinomial_min_FSNs = 5,
  derep_nodes = FALSE,
  jaccard_cutoff = 0.75,
  BSN_p_cutoff = 0.05,
  BSN_correction = "none",
  FSN_p_cutoff = 0.05,
  FSN_correction = "none",
  func_descrip_infile = NULL,
  multinomial_correction = "BH",
  detailed_output = FALSE,
  verbose = FALSE
)

Arguments

abun

dataframe of taxa abundances that are at the tips of the input tree. These taxa are usually individual genomes. The taxa need to be the rows and the samples the columns.

func

dataframe of the number of copies of each function that are encoded by each input taxon. This pipeline only considers the presence/absence of functions across taxa. Taxa (with row names intersecting with the "abun" table) should be the rows and the functions should be the columns.

tree

phylo object with tip labels that match the row names of the "abun" and "func" tables. This object is usually based on a newick-formatted tree that has been read into R with the ape R package.

group1_samples

character vector of column names of "abun" table that correspond to the first sample group. This grouping is used for testing for significant sample balances at each node. Required unless the "manual_BSN_dir" argument is set (i.e., if the binary directions of BSNs are specified manually).

group2_samples

same as "group1_samples", but corresponding to the second sample group.

ncores

integer specifying how many cores to use for parallelized sections of pipeline.

pseudocount

number added to all cells of "abun" table to avoid 0 values. Set this to be 0 if this is not desired. Note that there will be issues with the balance tree approach if any 0's are present.

manual_BSNs

optional vector of node names that match node labels of input tree. These nodes will be considered the set of balance-significant nodes, and the Wilcoxon tests will not be run. The group means of the balances at each node will still be used to determine which group has higher values. Note this requires that the "manual_balances" argument is also specified.

manual_balances

optional list of balance values which represent the balances at all tested nodes that resulted in the input to the manual_BSNs vector. This list must include balances for all nodes in the manual_BSNs vector, but also all non-significant tested nodes as well. These node labels must all be present in the input tree. The required list format is the "balances" object in the output of compute_node_balances. Note, however, that any approach for computing balances could be used, as long as they are in this list format.

manual_BSN_dir

optional character vector specifying "group1" or "group2", depending on the direction of the BSN difference. This must be a named vector, with all names matching the set of nodes specified by the manual_BSNs argument. Although this requires that the exact labels "group1" or "group2" are specified, these categories could represent different binary divisions rather than strict sample groups. For instance, "group1" could be used to represent nodes where sample balances are positively associated with a continuous variable (rather than a discrete grouping), whereas "group2" could represent nodes where sample balances are negatively associated.

min_num_tips

minimum number of tips on each side of the nodes that is required for them to be retained in the analysis. This argument is ignored if significant nodes are specified manually.

min_func_instances

minimum number of tips that must encode the function for it to be retained for the analysis.

min_func_prop

minimum proportion of tips that must encode the function for it to be retained for the analysis.

multinomial_min_FSNs

The minimum number of FSNs required to run a multinomial test for a given function.

derep_nodes

boolean value specifying whether nodes should be dereplicated based on similar sets of underlying tips (EXPERIMENTAL setting). More specifically, whether nodes should be clustered based on how similar their underlying tips are (given a Jaccard index cut-off, specified as separately), and then only retaining the node with the fewest underlying tips per cluster.

jaccard_cutoff

Numeric vector of length 1. Must be between 0 and 1 (inclusive). Corresponds to the Jaccard cut-off used for clustering nodes based on similar sets of underlying tips (when derep_nodes = TRUE).

BSN_p_cutoff

significance cut-off for identifying BSNs.

BSN_correction

multiple-test correction to use on Wilcoxon test p-values when identifying BSNs. Must be in p.adjust.methods.

FSN_p_cutoff

significance cut-off for identifying FSNs.

FSN_correction

multiple-test correction to use on Fisher's exact test p-values when identifying FSNs. Must be in p.adjust.methods.

func_descrip_infile

optional path to mapfile of function ids (column 1) to descriptions (column 2). This should be tab-delimited with no header and one function per line. If this option is specified then an additional description column will be added to the output table.

multinomial_correction

multiple-test correction to use on raw multinomial test p-values. Must be in p.adjust.methods.

detailed_output

boolean flag to indicate that several intermediate objects should be included in the final output. This is useful when troubleshooting issues, but is not expected to be useful for most users.
The additional results include:

  • balance_comparisons (summary of Wilcoxon tests on balances)

  • func_enrichments (Fisher's exact test output for all functions at each node)

  • input_param (a list containing the specified input parameters)

verbose

boolean flag to indicate that log information should be written to the console.

Details

Identifies significant nodes based on sample balances, using a Wilcoxon test by default. Alternatively, significant nodes can be manually specified. Either way, significant nodes based on sample balances are referred to as Balance-Significant Nodes (BSNs).

Fisher's exact tests are run at each node in the tree with sufficient numbers of underlying tips on each side to test for functional enrichment. Significant nodes based on this test are referred to as Function-Significant Nodes (FSNs). The set of FSNs is determined independently for each tested function.

The key output is the tally of the intersecting nodes based on the sets of BSNs and FSNs.

Each FSN can be categorized in one of three ways:

A multinomial test is run to see if the number of FSNs of each type is significantly different from the random expectation.

Value

list containing (at minimum) these elements:


[Package POMS version 1.0.1 Index]