R: Read in, check and preprocess the allele dataset

inputData {PolyPatEx}

R Documentation

Read in, check and preprocess the allele dataset

Description

Read in an allele dataset from file, and return a checked and preprocessed data frame.

Usage

inputData(file, numLoci, ploidy, dataType, dioecious, selfCompatible = NULL,
  mothersOnly = NULL, lociMin = 1, matMismatches = 0, skip = 0)

Arguments

`file`	character: the name of the allele data file.
`numLoci`	integer: the number of loci in the allele dataset.
`ploidy`	integer: the species' ploidy, one of `2`, `4`, `6`, or `8`.
`dataType`	character: either `"genotype"` or `"phenotype"`.
`dioecious`	logical: is the species dioecious (`TRUE`) or monoecious (`FALSE`)?
`selfCompatible`	logical: In monoecious species (`dioecious=FALSE`), can individuals self-fertilise? When `dioecious=FALSE`, this argument may be left at its default value of NULL - it will be set to `FALSE` by `preprocessData`.
`mothersOnly`	logical: in dioecious species, should females without progeny present be removed from the dataset? If `dioecious=TRUE`, then `mothersOnly` must be set to either `TRUE` or `FALSE`. If `dioecious=FALSE`, argument `mothersOnly` should be left at its default value of `NULL`.
`lociMin`	integer: the minimum number of loci in an individual that must have alleles present for the individual (and its progeny, if the individual is a mother) to be retained in the dataset. See the help for `preprocessData` for more on this parameter.
`matMismatches`	an integer between 0 and `numLoci`-1, being the maximum number of mismatching loci between mother and offspring that are allowed before the offspring is removed from the dataset. The default value is 0. If an offspring has fewer than `matMismatches` loci that mismatch with its mother, the offending loci are set to contain no alleles.
`skip`	integer: the number of lines in the CSV to skip before the header row of the table.

Details

inputData reads in an allele dataset from the specified file, then calls preprocessData to perform a series of data format checks and preprocessing steps before returning the checked and preprocessed dataset as an R data frame. The reference information for preprocessData contains further information on the checks and preprocessing - it is strongly recommended you read that information in addition to the information below.

The use of inputData is optional, if you wish to create or load the allele dataset into R by other means. However, it is then necessary to call preprocessData on the data frame prior to using any other analysis functions in this package. Similarly, if you decide to change or manipulate the data frame contents within R, you should call preprocessData again on the data frame prior to using any of the PolyPatEx analysis functions. See the help for preprocessData for further details.

Note that inputData strips leading or trailing spaces (whitespace) from each entry in the allele dataset as it is read in. If you load your data by a means other than inputData, you should ensure that you perform this step yourself, as preprocessData will not carry out this necessary step.

Note also that you should not use spaces in any of your allele codes - PolyPatEx functions use spaces to separate allele codes as they process the data - if allele codes already contains spaces, errors will occur in this processing. If you need a separator, I recommend using either ‘code.’ (a period) or ‘code_’ (an underscore) rather than a space.

Neither inputData (nor preprocessData) will alter the CSV file from which the data is loaded - they merely return a checked and preprocessed version of your allele dataset (in the form of an R data frame) within the R environment, ready for use by other PolyPatEx functions.

To load the allele dataset into R, inputData calls R's read.csv function with certain arguments specified. These arguments make read.csv more stringent about the precise format of the input datafile, requiring in particular that each row of the CSV-formatted data file contain the correct number of commas. This is not always guaranteed when the CSV file has been exported from spreadsheet software. Should you get ‘Error in scan’ messages complaining about the number of elements in a line of the input file, consider calling fixCSV on the data file, before calling inputData again. fixCSV attempts to find and correct such errors in a CSV file - see the help for this function. Note that if you specify the skip parameter in a call to fixCSV, you should use the same value for this parameter in inputData to avoid an error.

The various PolyPatEx functions need to know the characteristics of the dataset being analysed - these are specified in the inputData or preprocessData calls and are invisibly attached to the allele data frame that is returned, for use by other PolyPatEx functions. The required characteristics are:

numLoci: the number of loci in the dataset.
ploidy: the ploidy p of the species (currently allowed to be 4, 6, or 8. ploidy can also be 2, provided dataType="genotype").
dataType: whether the data is genotypic (all p alleles at each locus are observed) or phenotypic (only the distinct allele states at a locus are observed - alleles that appear more than once in the genotype of a locus only appear once in the phenotype).
dioecious: whether the species is dioecious or monoecious.
selfCompatible whether a monoecious species is self compatible (i.e., whether an individual can fertilise itself).
mothersOnly: whether a dioecious dataset should retain only adult females that are mothers of progeny in the dataset. If dioecious=TRUE, then mothersOnly must be set to either TRUE or FALSE.

Value

A data frame, containing the checked and pre-processed allele data, ready for further analysis by other PolyPatEx functions. All columns in the output data frame will be of mode character.

Author(s)

Alexander Zwart (alec.zwart at csiro.au)

Examples

## Not run: 

## Obtain path to the example genotype data file
## 'FR_Genotype.csv'
gDataFile <- system.file("extdata/FR_Genotype.csv",
                         package="PolyPatEx")
print(gDataFile)

gData <- inputData(gDataFile,
                   numLoci=7,
                   ploidy=4,
                   dataType="genotype",
                   dioecious=TRUE,
                   mothersOnly=TRUE)

## ...or use 'mothersOnly=FALSE' if you wish to retain
## non-maternal females in the dataset.

## gData now contains the checked and preprocessed allele dataset,
## ready to be passed to other PolyPatEx analysis functions.

## In your own workflow, you would typically specify the path to
## your allele dataset directly - e.g. if the dataset
## myAlleleData.csv is on the Data subdirectory of the current R
## working directory (see R function setwd()), then:
##
## gData <- inputData("Data/myAlleleData.csv",
##                    numLoci= etc etc etc...,


## End(Not run)

[Package PolyPatEx version 0.9.2 Index]