| loadSQM {SQMtools} | R Documentation |
Load a SqueezeMeta project into R
Description
This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py.
Usage
loadSQM(
project_path,
tax_mode = "prokfilter",
trusted_functions_only = FALSE,
engine = "data.table"
)
Arguments
project_path |
character, project directory generated by SqueezeMeta, or zip file generated by |
tax_mode |
character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use |
trusted_functions_only |
logical. If |
engine |
character. Engine used to load the ORFs and contigs tables. Either |
Value
SQM object containing the parsed project.
Prerequisites
Run SqueezeMeta! An example call for running it would be:
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl
-m coassembly -f fastq_dir -s samples_file -p project_dir
The SQM object structure
The SQM object is a nested list which contains the following information:
| lvl1 | lvl2 | lvl3 | type | rows/names | columns | data |
| $orfs | $table | dataframe | orfs | misc. data | misc. data | |
| $abund | numeric matrix | orfs | samples | abundances (reads) | ||
| $bases | numeric matrix | orfs | samples | abundances (bases) | ||
| $cov | numeric matrix | orfs | samples | coverages | ||
| $cpm | numeric matrix | orfs | samples | covs. / 10^6 reads | ||
| $tpm | numeric matrix | orfs | samples | tpm | ||
| $seqs | character vector | orfs | (n/a) | sequences | ||
| $tax | character matrix | orfs | tax. ranks | taxonomy | ||
| $contigs | $table | dataframe | contigs | misc. data | misc. data | |
| $abund | numeric matrix | contigs | samples | abundances (reads) | ||
| $bases | numeric matrix | contigs | samples | abundances (bases) | ||
| $cov | numeric matrix | contigs | samples | coverages | ||
| $cpm | numeric matrix | contigs | samples | covs. / 10^6 reads | ||
| $tpm | numeric matrix | contigs | samples | tpm | ||
| $seqs | character vector | contigs | (n/a) | sequences | ||
| $tax | character matrix | contigs | tax. ranks | taxonomies | ||
| $bins | character matrix | contigs | bin. methods | bins | ||
| $bins | $table | dataframe | bins | misc. data | misc. data | |
| $length | numeric vector | bins | (n/a) | length | ||
| $abund | numeric matrix | bins | samples | abundances (reads) | ||
| $percent | numeric matrix | bins | samples | abundances (reads) | ||
| $bases | numeric matrix | bins | samples | abundances (bases) | ||
| $cov | numeric matrix | bins | samples | coverages | ||
| $cpm | numeric matrix | bins | samples | covs. / 10^6 reads | ||
| $tax | character matrix | bins | tax. ranks | taxonomy | ||
| $taxa | $superkingdom | $abund | numeric matrix | superkingdoms | samples | abundances (reads) |
| $percent | numeric matrix | superkingdoms | samples | percentages | ||
| $phylum | $abund | numeric matrix | phyla | samples | abundances (reads) | |
| $percent | numeric matrix | phyla | samples | percentages | ||
| $class | $abund | numeric matrix | classes | samples | abundances (reads) | |
| $percent | numeric matrix | classes | samples | percentages | ||
| $order | $abund | numeric matrix | orders | samples | abundances (reads) | |
| $percent | numeric matrix | orders | samples | percentages | ||
| $family | $abund | numeric matrix | families | samples | abundances (reads) | |
| $percent | numeric matrix | families | samples | percentages | ||
| $genus | $abund | numeric matrix | genera | samples | abundances (reads) | |
| $percent | numeric matrix | genera | samples | percentages | ||
| $species | $abund | numeric matrix | species | samples | abundances (reads) | |
| $percent | numeric matrix | species | samples | percentages | ||
| $functions | $KEGG | $abund | numeric matrix | KEGG ids | samples | abundances (reads) |
| $bases | numeric matrix | KEGG ids | samples | abundances (bases) | ||
| $cov | numeric matrix | KEGG ids | samples | coverages | ||
| $cpm | numeric matrix | KEGG ids | samples | covs. / 10^6 reads | ||
| $tpm | numeric matrix | KEGG ids | samples | tpm | ||
| $copy_number | numeric matrix | KEGG ids | samples | avg. copies | ||
| $COG | $abund | numeric matrix | COG ids | samples | abundances (reads) | |
| $bases | numeric matrix | COG ids | samples | abundances (bases) | ||
| $cov | numeric matrix | COG ids | samples | coverages | ||
| $cpm | numeric matrix | COG ids | samples | covs. / 10^6 reads | ||
| $tpm | numeric matrix | COG ids | samples | tpm | ||
| $copy_number | numeric matrix | COG ids | samples | avg. copies | ||
| $PFAM | $abund | numeric matrix | PFAM ids | samples | abundances (reads) | |
| $bases | numeric matrix | PFAM ids | samples | abundances (bases) | ||
| $cov | numeric matrix | PFAM ids | samples | coverages | ||
| $cpm | numeric matrix | PFAM ids | samples | covs. / 10^6 reads | ||
| $tpm | numeric matrix | PFAM ids | samples | tpm | ||
| $copy_number | numeric matrix | PFAM ids | samples | avg. copies | ||
| $total_reads | numeric vector | samples | (n/a) | total reads | ||
| $misc | $project_name | character vector | (empty) | (n/a) | project name | |
| $samples | character vector | (empty) | (n/a) | samples | ||
| $tax_names_long | $superkingdom | character vector | short names | (n/a) | full names | |
| $phylum | character vector | short names | (n/a) | full names | ||
| $class | character vector | short names | (n/a) | full names | ||
| $order | character vector | short names | (n/a) | full names | ||
| $family | character vector | short names | (n/a) | full names | ||
| $genus | character vector | short names | (n/a) | full names | ||
| $species | character vector | short names | (n/a) | full names | ||
| $tax_names_short | character vector | full names | (n/a) | short names | ||
| $KEGG_names | character vector | KEGG ids | (n/a) | KEGG names | ||
| $KEGG_paths | character vector | KEGG ids | (n/a) | KEGG hiararchy | ||
| $COG_names | character vector | COG ids | (n/a) | COG names | ||
| $COG_paths | character vector | COG ids | (n/a) | COG hierarchy | ||
| $ext_annot_sources | character vector | COG ids | (n/a) | external databases | ||
If external databases for functional classification were provided to SqueezeMeta via the -extdb argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions (e.g. results for the CAZy database would be present in SQM$functions$CAZy). Additionally, the extended names of the features present in the external database will be present in SQM$misc (e.g. SQM$misc$CAZy_names).
Examples
## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.
## End(Not run)
data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Gammaproteobacteria class across samples?
Hadza$taxa$class$percent["Gammaproteobacteria",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!