importContigs {poolABC} | R Documentation |
Import multiple files containing data in PoPoolation2 format
Description
Imports multiple files containing data in PoPoolation2 format and organize that information into different entries for each contig.
Usage
importContigs(
path,
pops,
files = NA,
header = NA,
remove = NA,
min.minor = NA,
filter = FALSE,
threshold = NA
)
Arguments
path |
is a character string indicating the path to the folder where the data you wish to import is located. |
pops |
is a vector with the index of the populations that should be imported. This function works for two or four populations and so this vector must have either length 2 or 4. |
files |
is an integer or a numeric vector with the index of the files you wish to import. |
header |
is a character vector containing the names for the columns. If set to NA (default), no column names will be added to the output. |
remove |
is a character vector where each entry is a name of a contig to be removed. These contigs are, obviously, removed from the imported dataset. If NA (default), all contigs will be kept in the output. |
min.minor |
what is the minimum allowed number of reads with the minor allele across all populations? Sites where this threshold is not met are removed from the data. |
filter |
is a logical switch, either TRUE or FALSE. If TRUE, then the data is filtered by the frequency of the minor allele and if FALSE, that filter is not applied. |
threshold |
is the minimum allowed frequency for the minor allele. Sites where the allelic frequency is below this threshold are removed from the data. |
Details
The data from two or four populations is split so that the number of major-allele reads, minor-allele reads, total depth of coverage and remaining relevant information are kept on separate list entries. Sites where the sum of the major and minor allele reads does not match the total coverage and sites where any population has an "N" as the reference character of their major allele, are removed from the data. This function also ensures that the major allele is the same and the most frequent across all populations. Note also that all non biallelic sites and sites where the sum of deletions in all populations is not zero will be removed from the dataset.
If the min.minor
input is supplied, sites where the total number of
minor-allele reads is below the specified number, will be removed from the
data set. Alternatively, if the filter input is set to TRUE, data will be
filtered by the frequency of the minor-allele. If a threshold is supplied,
the computed frequency is compared to that threshold and sites where the
frequency is below the threshold are removed from the dataset. If no
threshold is supplied, the threshold is assumed to be 1/total
coverage
, meaning that a site should have, at least, one minor-allele read.
Finally, the name of each contig is used to organize the information in a per contig basis. Thus, each output will be organized by contig. For example, the list with the number of minor-allele reads will contain several entries and each of those entries is a different contig.
Value
a list with six named entries:
freqs |
a list with the allele frequencies, computed by dividing the number of minor-allele reads by the total coverage. Each entry of this list corresponds to a different contig. Each entry is a matrix where each row is a different site and each column is a different population. |
positions |
a list with the positions of each SNP. Each entry of this list is a vector corresponding to a different contig. |
range |
a list with the minimum and maximum SNP position of each contig. Each entry of this list is a vector corresponding to a different contig. |
rMajor |
a list with the number of major-allele reads. Each entry of this list corresponds to a different contig. Each entry is a matrix where each row is a different site and each column is a different population. |
rMinor |
a list with the number of minor-allele reads. Each entry of this list corresponds to a different contig. Each entry is a matrix where each row is a different site and each column is a different population. |
coverage |
a list with the total coverage. Each entry of this list corresponds to a different contig. Each entry is a matrix where each row is a different site and each column is a different population. |
See Also
For more details see the poolABC vignette:
vignette("poolABC", package = "poolABC")
Examples
# this function should be used to import your data
# you should include the path to the folder your PoPoolation2 data is
# this creates a variable with the path for the toy example data
mypath <- system.file('extdata', package = 'poolABC')
# an example of how to import data for two populations from all files
importContigs(path = mypath, pops = c(8, 10))
# to remove contigs from the data
importContigs(path = mypath, pops = c(8, 10), remove = "Contig1708")