R: Organize information by contig - for multiple data files

prepareData {poolABC}

R Documentation

Organize information by contig - for multiple data files

Description

Organize the information of multiple _rc files into different entries for each contig.

Usage

prepareData(data, nPops, filter = FALSE, threshold = NA)

Arguments

`data`	is a list with four different entries. The entries should be named as "rMajor", "rMinor", "coverage" and "info". The `rMajor` entry should be a matrix containing the number of observed major-allele reads. The `rMinor` entry should be a matrix containing the number of observed minor-allele reads. The `coverage` entry should be a matrix containing the total depth of coverage. The `info` entry should be a matrix or a data frame containing the remaining relevant information, such as the contig name and the position of each SNP. Each row of these matrices should be a different site and each column should be a different population.
`nPops`	is an integer indicating the total number of different populations in the dataset.
`filter`	is a logical switch, either TRUE or FALSE. If TRUE, then the data is filtered by the frequency of the minor allele and if FALSE, that filter is not applied.
`threshold`	is the minimum allowed frequency for the minor allele. Sites where the allelic frequency is below this threshold are removed from the data.

Details

This function removes all monomorphic sites from the dataset. Monomorphic sites are those where the frequency for all populations is 1 or 0. Then, the name of each contig is used to organize the information in a per contig basis. Thus, each output will be organized by contig. For example, the list with the number of minor-allele reads will contain several entries and each of those entries is a different contig.

If the filter input is set to TRUE, this function also filters the data by the frequency of the minor-allele. If a threshold is supplied, the computed frequency is compared to that threshold and sites where the frequency is below the threshold are removed from the dataset. If no threshold is supplied, the threshold is assumed to be 1/total coverage, meaning that a site should have, at least, one minor-allele read.

Value

a list with six named entries:

`freqs`	a list with the allele frequencies, computed by dividing the number of minor-allele reads by the total coverage. Each entry of this list corresponds to a different contig. Each entry is a matrix where each row is a different site and each column is a different population.
`positions`	a list with the positions of each SNP. Each entry of this list is a vector corresponding to a different contig.
`range`	a list with the minimum and maximum SNP position of each contig. Each entry of this list is a vector corresponding to a different contig.
`rMajor`	a list with the number of major-allele reads. Each entry of this list corresponds to a different contig. Each entry is a matrix where each row is a different site and each column is a different population.
`rMinor`	a list with the number of minor-allele reads. Each entry of this list corresponds to a different contig. Each entry is a matrix where each row is a different site and each column is a different population.
`coverage`	a list with the total coverage. Each entry of this list corresponds to a different contig. Each entry is a matrix where each row is a different site and each column is a different population.

Examples

# load the data from two rc files
data(rc1, rc2)
# combine both files into a single list
mydata <- list(rc1, rc2)

# clean and organize the data for both files
mydata <- lapply(mydata, function(i) cleanData(file = i, pops = 7:10))

# organize the information by contigs
prepareData(data = mydata, nPops = 4)

[Package poolABC version 1.0.0 Index]