loadSQM {SQMtools} | R Documentation |
Load a SqueezeMeta project into R
Description
This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p
parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py
.
Usage
loadSQM(
project_path,
tax_mode = "prokfilter",
trusted_functions_only = FALSE,
engine = "data.table"
)
Arguments
project_path |
character, project directory generated by SqueezeMeta, or zip file generated by |
tax_mode |
character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use |
trusted_functions_only |
logical. If |
engine |
character. Engine used to load the ORFs and contigs tables. Either |
Value
SQM object containing the parsed project.
Prerequisites
Run SqueezeMeta! An example call for running it would be:
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl
-m coassembly -f fastq_dir -s samples_file -p project_dir
The SQM object structure
The SQM object is a nested list which contains the following information:
lvl1 | lvl2 | lvl3 | type | rows/names | columns | data |
$orfs | $table | dataframe | orfs | misc. data | misc. data | |
$abund | numeric matrix | orfs | samples | abundances (reads) | ||
$bases | numeric matrix | orfs | samples | abundances (bases) | ||
$cov | numeric matrix | orfs | samples | coverages | ||
$cpm | numeric matrix | orfs | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | orfs | samples | tpm | ||
$seqs | character vector | orfs | (n/a) | sequences | ||
$tax | character matrix | orfs | tax. ranks | taxonomy | ||
$contigs | $table | dataframe | contigs | misc. data | misc. data | |
$abund | numeric matrix | contigs | samples | abundances (reads) | ||
$bases | numeric matrix | contigs | samples | abundances (bases) | ||
$cov | numeric matrix | contigs | samples | coverages | ||
$cpm | numeric matrix | contigs | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | contigs | samples | tpm | ||
$seqs | character vector | contigs | (n/a) | sequences | ||
$tax | character matrix | contigs | tax. ranks | taxonomies | ||
$bins | character matrix | contigs | bin. methods | bins | ||
$bins | $table | dataframe | bins | misc. data | misc. data | |
$length | numeric vector | bins | (n/a) | length | ||
$abund | numeric matrix | bins | samples | abundances (reads) | ||
$percent | numeric matrix | bins | samples | abundances (reads) | ||
$bases | numeric matrix | bins | samples | abundances (bases) | ||
$cov | numeric matrix | bins | samples | coverages | ||
$cpm | numeric matrix | bins | samples | covs. / 10^6 reads | ||
$tax | character matrix | bins | tax. ranks | taxonomy | ||
$taxa | $superkingdom | $abund | numeric matrix | superkingdoms | samples | abundances (reads) |
$percent | numeric matrix | superkingdoms | samples | percentages | ||
$phylum | $abund | numeric matrix | phyla | samples | abundances (reads) | |
$percent | numeric matrix | phyla | samples | percentages | ||
$class | $abund | numeric matrix | classes | samples | abundances (reads) | |
$percent | numeric matrix | classes | samples | percentages | ||
$order | $abund | numeric matrix | orders | samples | abundances (reads) | |
$percent | numeric matrix | orders | samples | percentages | ||
$family | $abund | numeric matrix | families | samples | abundances (reads) | |
$percent | numeric matrix | families | samples | percentages | ||
$genus | $abund | numeric matrix | genera | samples | abundances (reads) | |
$percent | numeric matrix | genera | samples | percentages | ||
$species | $abund | numeric matrix | species | samples | abundances (reads) | |
$percent | numeric matrix | species | samples | percentages | ||
$functions | $KEGG | $abund | numeric matrix | KEGG ids | samples | abundances (reads) |
$bases | numeric matrix | KEGG ids | samples | abundances (bases) | ||
$cov | numeric matrix | KEGG ids | samples | coverages | ||
$cpm | numeric matrix | KEGG ids | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | KEGG ids | samples | tpm | ||
$copy_number | numeric matrix | KEGG ids | samples | avg. copies | ||
$COG | $abund | numeric matrix | COG ids | samples | abundances (reads) | |
$bases | numeric matrix | COG ids | samples | abundances (bases) | ||
$cov | numeric matrix | COG ids | samples | coverages | ||
$cpm | numeric matrix | COG ids | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | COG ids | samples | tpm | ||
$copy_number | numeric matrix | COG ids | samples | avg. copies | ||
$PFAM | $abund | numeric matrix | PFAM ids | samples | abundances (reads) | |
$bases | numeric matrix | PFAM ids | samples | abundances (bases) | ||
$cov | numeric matrix | PFAM ids | samples | coverages | ||
$cpm | numeric matrix | PFAM ids | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | PFAM ids | samples | tpm | ||
$copy_number | numeric matrix | PFAM ids | samples | avg. copies | ||
$total_reads | numeric vector | samples | (n/a) | total reads | ||
$misc | $project_name | character vector | (empty) | (n/a) | project name | |
$samples | character vector | (empty) | (n/a) | samples | ||
$tax_names_long | $superkingdom | character vector | short names | (n/a) | full names | |
$phylum | character vector | short names | (n/a) | full names | ||
$class | character vector | short names | (n/a) | full names | ||
$order | character vector | short names | (n/a) | full names | ||
$family | character vector | short names | (n/a) | full names | ||
$genus | character vector | short names | (n/a) | full names | ||
$species | character vector | short names | (n/a) | full names | ||
$tax_names_short | character vector | full names | (n/a) | short names | ||
$KEGG_names | character vector | KEGG ids | (n/a) | KEGG names | ||
$KEGG_paths | character vector | KEGG ids | (n/a) | KEGG hiararchy | ||
$COG_names | character vector | COG ids | (n/a) | COG names | ||
$COG_paths | character vector | COG ids | (n/a) | COG hierarchy | ||
$ext_annot_sources | character vector | COG ids | (n/a) | external databases | ||
If external databases for functional classification were provided to SqueezeMeta via the -extdb
argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions
(e.g. results for the CAZy database would be present in SQM$functions$CAZy
). Additionally, the extended names of the features present in the external database will be present in SQM$misc
(e.g. SQM$misc$CAZy_names
).
Examples
## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.
## End(Not run)
data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Gammaproteobacteria class across samples?
Hadza$taxa$class$percent["Gammaproteobacteria",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!