cleanData {poolABC} | R Documentation |
Import and clean a single file containing data in popoolation2
format
Description
Imports data for two or four populations from a single file containing data in the _rc format. The data is then split so that the number of major-allele reads, minor-allele reads, total depth of coverage and remaining relevant information are kept on separate matrices.
Usage
cleanData(file, pops, header = NA, remove = NA, min.minor = NA)
Arguments
file |
is a character string indicating the path to the file you wish to import. |
pops |
is a vector with the index of the populations that should be imported. This function works for two or four populations and so this vector must have either length 2 or 4. |
header |
is a character vector containing the names for the columns. If set to NA (default), no column names will be added to the output. |
remove |
is a character vector where each entry is a name of a contig to be removed. These contigs are, obviously, removed from the imported dataset. If NA (default), all contigs will be kept in the output. |
min.minor |
what is the minimum allowed number of reads with the minor allele across all populations? Sites where this threshold is not met are removed from the data. The default (NA) means that no sites will be removed because of their number of minor-allele reads. |
Details
The information in the _rc format is stored in a x/y format, where x
represents the observed reads and the y is the coverage. The initial step of
this function splits this string to separate the number of reads from the
total coverage. Then, the number of major plus minor allele reads is compared
to the total coverage and sites where both values are not equal are removed
from the dataset. Additionally, sites where any of the populations has an "N"
as the reference character of their major allele, are removed from the data.
This function also ensures that the major allele is the same and the most
frequent across all populations. Finally, if the min.minor
input is
supplied, sites where the total number of minor-allele reads is below the
specified number, will be removed from the data set.
Note also that all non biallelic sites and sites where the sum of deletions
in all populations is not zero will be removed from the dataset. Although
this function can only import 2 or 4 populations at the time, it is possible
to define which two or four populations to import. For instance, if we define
the first population as the first column for which we have data in the x/y
format, then you could wish to import the data for the 5th and 6th
populations, defined as the populations in the 6th and 7th columns. To do so,
you should define the pops
input as pops = c(5, 6)
.
Value
a list with the following elements:
rMajor |
a matrix with the number of major-allele reads. Each row of this matrix is a different site and each column a different population. |
rMinor |
a matrix with the number of minor-allele reads. Each row of this matrix is a different site and each column a different population. |
coverage |
a matrix with the total coverage. Each row of this matrix is a different site and each column a different population. |
info |
a data frame with 5 different columns containing: the contig name, the SNP position, the reference character of the SNP and the reference character of the major and minor allele for each of the populations. Each row of this data frame corresponds to a different site |
Examples
# load the data from one rc file
data(rc1)
# clean and organize the data in this single file
cleanData(file = rc1, pops = 7:10)