R: Import and clean a single file containing data in...

cleanData {poolABC}

R Documentation

Import and clean a single file containing data in `popoolation2` format

Description

Imports data for two or four populations from a single file containing data in the _rc format. The data is then split so that the number of major-allele reads, minor-allele reads, total depth of coverage and remaining relevant information are kept on separate matrices.

Usage

cleanData(file, pops, header = NA, remove = NA, min.minor = NA)

Arguments

`file`	is a character string indicating the path to the file you wish to import.
`pops`	is a vector with the index of the populations that should be imported. This function works for two or four populations and so this vector must have either length 2 or 4.
`header`	is a character vector containing the names for the columns. If set to NA (default), no column names will be added to the output.
`remove`	is a character vector where each entry is a name of a contig to be removed. These contigs are, obviously, removed from the imported dataset. If NA (default), all contigs will be kept in the output.
`min.minor`	what is the minimum allowed number of reads with the minor allele across all populations? Sites where this threshold is not met are removed from the data. The default (NA) means that no sites will be removed because of their number of minor-allele reads.

Details

The information in the _rc format is stored in a x/y format, where x represents the observed reads and the y is the coverage. The initial step of this function splits this string to separate the number of reads from the total coverage. Then, the number of major plus minor allele reads is compared to the total coverage and sites where both values are not equal are removed from the dataset. Additionally, sites where any of the populations has an "N" as the reference character of their major allele, are removed from the data. This function also ensures that the major allele is the same and the most frequent across all populations. Finally, if the min.minor input is supplied, sites where the total number of minor-allele reads is below the specified number, will be removed from the data set.

Note also that all non biallelic sites and sites where the sum of deletions in all populations is not zero will be removed from the dataset. Although this function can only import 2 or 4 populations at the time, it is possible to define which two or four populations to import. For instance, if we define the first population as the first column for which we have data in the x/y format, then you could wish to import the data for the 5th and 6th populations, defined as the populations in the 6th and 7th columns. To do so, you should define the pops input as pops = c(5, 6).

Value

a list with the following elements:

`rMajor`	a matrix with the number of major-allele reads. Each row of this matrix is a different site and each column a different population.
`rMinor`	a matrix with the number of minor-allele reads. Each row of this matrix is a different site and each column a different population.
`coverage`	a matrix with the total coverage. Each row of this matrix is a different site and each column a different population.
`info`	a data frame with 5 different columns containing: the contig name, the SNP position, the reference character of the SNP and the reference character of the major and minor allele for each of the populations. Each row of this data frame corresponds to a different site

Examples

# load the data from one rc file
data(rc1)
# clean and organize the data in this single file
cleanData(file = rc1, pops = 7:10)

[Package poolABC version 1.0.0 Index]