R: Load a SqueezeMeta project into R

loadSQM {SQMtools}

R Documentation

Load a SqueezeMeta project into R

Description

This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py.

Usage

loadSQM(
  project_path,
  tax_mode = "prokfilter",
  trusted_functions_only = FALSE,
  engine = "data.table"
)

Arguments

`project_path`	character, project directory generated by SqueezeMeta, or zip file generated by `sqm2zip.py`.
`tax_mode`	character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use `allfilter` for applying the minimum identity threshold to all taxa, `prokfilter` for applying the threshold to Bacteria and Archaea, but not to Eukaryotes, and `nofilter` for applying no thresholds at all (default `prokfilter`).
`trusted_functions_only`	logical. If `TRUE`, only highly trusted functional annotations (best hit + best average) will be considered when generating aggregated function tables. If `FALSE`, best hit annotations will be used (default `FALSE`). Will only have an effect if the `project_dir/results/tables` is not already present.
`engine`	character. Engine used to load the ORFs and contigs tables. Either `data.frame` or `data.table` (significantly faster if your project is large). Default `data.table`.

Value

SQM object containing the parsed project.

Prerequisites

Run SqueezeMeta! An example call for running it would be:

/path/to/SqueezeMeta/scripts/SqueezeMeta.pl
-m coassembly -f fastq_dir -s samples_file -p project_dir

The SQM object structure

The SQM object is a nested list which contains the following information:

lvl1	lvl2	lvl3	type	rows/names	columns	data
$orfs	$table		dataframe	orfs	misc. data	misc. data
	$abund		numeric matrix	orfs	samples	abundances (reads)
	$bases		numeric matrix	orfs	samples	abundances (bases)
	$cov		numeric matrix	orfs	samples	coverages
	$cpm		numeric matrix	orfs	samples	covs. / 10^6 reads
	$tpm		numeric matrix	orfs	samples	tpm
	$seqs		character vector	orfs	(n/a)	sequences
	$tax		character matrix	orfs	tax. ranks	taxonomy
$contigs	$table		dataframe	contigs	misc. data	misc. data
	$abund		numeric matrix	contigs	samples	abundances (reads)
	$bases		numeric matrix	contigs	samples	abundances (bases)
	$cov		numeric matrix	contigs	samples	coverages
	$cpm		numeric matrix	contigs	samples	covs. / 10^6 reads
	$tpm		numeric matrix	contigs	samples	tpm
	$seqs		character vector	contigs	(n/a)	sequences
	$tax		character matrix	contigs	tax. ranks	taxonomies
	$bins		character matrix	contigs	bin. methods	bins
$bins	$table		dataframe	bins	misc. data	misc. data
	$length		numeric vector	bins	(n/a)	length
	$abund		numeric matrix	bins	samples	abundances (reads)
	$percent		numeric matrix	bins	samples	abundances (reads)
	$bases		numeric matrix	bins	samples	abundances (bases)
	$cov		numeric matrix	bins	samples	coverages
	$cpm		numeric matrix	bins	samples	covs. / 10^6 reads
	$tax		character matrix	bins	tax. ranks	taxonomy
$taxa	$superkingdom	$abund	numeric matrix	superkingdoms	samples	abundances (reads)
		$percent	numeric matrix	superkingdoms	samples	percentages
	$phylum	$abund	numeric matrix	phyla	samples	abundances (reads)
		$percent	numeric matrix	phyla	samples	percentages
	$class	$abund	numeric matrix	classes	samples	abundances (reads)
		$percent	numeric matrix	classes	samples	percentages
	$order	$abund	numeric matrix	orders	samples	abundances (reads)
		$percent	numeric matrix	orders	samples	percentages
	$family	$abund	numeric matrix	families	samples	abundances (reads)
		$percent	numeric matrix	families	samples	percentages
	$genus	$abund	numeric matrix	genera	samples	abundances (reads)
		$percent	numeric matrix	genera	samples	percentages
	$species	$abund	numeric matrix	species	samples	abundances (reads)
		$percent	numeric matrix	species	samples	percentages
$functions	$KEGG	$abund	numeric matrix	KEGG ids	samples	abundances (reads)
		$bases	numeric matrix	KEGG ids	samples	abundances (bases)
		$cov	numeric matrix	KEGG ids	samples	coverages
		$cpm	numeric matrix	KEGG ids	samples	covs. / 10^6 reads
		$tpm	numeric matrix	KEGG ids	samples	tpm
		$copy_number	numeric matrix	KEGG ids	samples	avg. copies
	$COG	$abund	numeric matrix	COG ids	samples	abundances (reads)
		$bases	numeric matrix	COG ids	samples	abundances (bases)
		$cov	numeric matrix	COG ids	samples	coverages
		$cpm	numeric matrix	COG ids	samples	covs. / 10^6 reads
		$tpm	numeric matrix	COG ids	samples	tpm
		$copy_number	numeric matrix	COG ids	samples	avg. copies
	$PFAM	$abund	numeric matrix	PFAM ids	samples	abundances (reads)
		$bases	numeric matrix	PFAM ids	samples	abundances (bases)
		$cov	numeric matrix	PFAM ids	samples	coverages
		$cpm	numeric matrix	PFAM ids	samples	covs. / 10^6 reads
		$tpm	numeric matrix	PFAM ids	samples	tpm
		$copy_number	numeric matrix	PFAM ids	samples	avg. copies
$total_reads			numeric vector	samples	(n/a)	total reads
$misc	$project_name		character vector	(empty)	(n/a)	project name
	$samples		character vector	(empty)	(n/a)	samples
	$tax_names_long	$superkingdom	character vector	short names	(n/a)	full names
		$phylum	character vector	short names	(n/a)	full names
		$class	character vector	short names	(n/a)	full names
		$order	character vector	short names	(n/a)	full names
		$family	character vector	short names	(n/a)	full names
		$genus	character vector	short names	(n/a)	full names
		$species	character vector	short names	(n/a)	full names
	$tax_names_short		character vector	full names	(n/a)	short names
	$KEGG_names		character vector	KEGG ids	(n/a)	KEGG names
	$KEGG_paths		character vector	KEGG ids	(n/a)	KEGG hiararchy
	$COG_names		character vector	COG ids	(n/a)	COG names
	$COG_paths		character vector	COG ids	(n/a)	COG hierarchy
	$ext_annot_sources		character vector	COG ids	(n/a)	external databases

If external databases for functional classification were provided to SqueezeMeta via the -extdb argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions (e.g. results for the CAZy database would be present in SQM$functions$CAZy). Additionally, the extended names of the features present in the external database will be present in SQM$misc (e.g. SQM$misc$CAZy_names).

Examples

## Not run: 
## (outside R)
## Run SqueezeMeta on the test data.
 /path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.

## End(Not run)

data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Gammaproteobacteria class across samples?
Hadza$taxa$class$percent["Gammaproteobacteria",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!

[Package SQMtools version 1.6.3 Index]