genDataPreprocess {Haplin} | R Documentation |
Pre-processing of the genetic data
Description
This function prepares the data to be used in Haplin analysis
Usage
genDataPreprocess(
data.in = stop("You have to give the object to preprocess!"),
map.file,
map.header = FALSE,
design = "triad",
file.out = "data_preprocessed",
dir.out = ".",
ncpu = 1,
overwrite = NULL
)
Arguments
data.in |
Input data, as loaded by genDataRead or genDataLoad. |
map.file |
Filename (with path if the file is not in current directory) of the .map file holding the SNP names, if available. |
map.header |
Logical: does the map.file contain a header in the first row? Default: FALSE. |
design |
The design used in the study - choose from:
. |
file.out |
The core name of the files that will contain the preprocessed data (character string); ready to load next time with genDataLoad function; default: "data_preprocessed". |
dir.out |
The directory that will contain the saved data; defaults to current working directory. |
ncpu |
The number of CPU cores to use - this speeds up the process for large datasets significantly. Default is 1 core, maximum is 1 less than the total number of cores available on a current machine (even if the number given by the user is more than that). |
overwrite |
Whether to overwrite the output files: if NULL (default), will prompt the user to give answer; set to TRUE, will automatically overwrite any existing files; and set to FALSE, will stop if the output files exist. |
Value
A list object with three elements:
-
cov.data - a
data.frame
with covariate data (if available in the input file) -
gen.data - a list with chunks of the genetic data; the data is divided column-wise, using 10,000 columns per chunk; each element of this list is a ff matrix
-
aux - a list with meta-data and important parameters:
-
variables - tabulated information of the covariate data;
-
variables.nas - how many NA values per each column of covariate data;
-
alleles - all the possible alleles in each marker;
-
alleles.nas - how many NA values in each marker;
-
nrows.with.missing - how many rows contain any missing allele information;
-
which.rows.with.missing - vector of indices of rows with missing data (if any)
.
-
Details
The .map file should contain at least two columns, where the second one contains SNP names. Any additional columns should be separated by a whitespace character, but will be ignored. The file should contain a header.
Examples
# The argument 'overwrite' is set to TRUE!
# First, read the data:
examples.dir <- system.file( "extdata", package = "Haplin" )
example.file <- file.path( examples.dir, "exmpl_data.ped" )
ped.data.read <- genDataRead( example.file, file.out = "exmpl_ped_data",
dir.out = tempdir( check = TRUE ), format = "ped", overwrite = TRUE )
ped.data.read
# Take only part of the data (if needed)
ped.data.part <- genDataGetPart( ped.data.read, design = "triad", markers = 10:12,
dir.out = tempdir( check = TRUE ), file.out = "exmpl_ped_data_part", overwrite = TRUE )
# Preprocess as "triad" data:
ped.data.preproc <- genDataPreprocess( ped.data.part, design = "triad",
dir.out = tempdir( check = TRUE ), file.out = "exmpl_data_preproc", overwrite = TRUE )
ped.data.preproc