assign.kfold {assignPOP}R Documentation

Population assignment test using K-fold cross-validation

Description

This function employs K-fold cross-validation for assignment tests. The results help estimate membership probabilities of every individual. It accepts genetic-only [object returned from read.genpop() or reducel.allele()], integrated [object returned from compile.data()], or non-genetic [R data frame with header] data as input, and outputs results to text files. Several built-in options are provided. See below for more details.

Usage

assign.kfold(
  x,
  k.fold = c(3, 4, 5),
  train.loci = c(0.1, 0.25, 0.5, 1),
  loci.sample = "fst",
  dir = NULL,
  scaled = FALSE,
  pca.method = "mixed",
  pca.PCs = "kaiser-guttman",
  pca.loadings = F,
  model = "svm",
  svm.kernel = "linear",
  svm.cost = 1,
  ntree = 50,
  processors = 999,
  multiprocess = TRUE,
  skipQ = FALSE,
  ...
)

Arguments

x

An input object which should be the object (list) returned from the function read.genpop(), reduce.allele(), or compile.data(). It could also be a data frame (with column name) returned from read.csv() or read.table() if you're analyzing non-genetic data, such as morphormetrics, chemistry data. The non-genetic data frame should have sample ID in the first column and population label in the last column.

k.fold

The number of groups to be divided for each population. Use a numeric vector to specify multiple sets of k-folds.

train.loci

The proportion (float between 0 and 1) of loci to be used as training data. Use a numeric vector to specify multiple sets of training loci. This argument will be ignored if you're analyzing non-genetic data.

loci.sample

Locus sampling method, "fst" or "random". If loci.sample="fst" (default) and train.loci=0.1, it means that top 10 percent of high Fst loci will be sampled as training loci. On the other hand, if loci.sample="random", then random 10 percent of loci will be sampled as training loci. This argument will be ignored if you're analyzing non-genetic data.

dir

A character string to specify the folder name for saving output files. A slash at the end must be included (e.g., dir="YourFolderName/"). Otherwise, the files will be saved under your working directory.

scaled

A logical variable (TRUE or FALSE) to specify whether to center (make mean of each feature to 0) and scale (make standard deviation of each feature to 1) the entire dataset before performing PCA and cross-validation. Default is FALSE. As genetic data has converted to numeric data between 0 and 1, to scale or not to scale the genetic data should not be critical. However, it is recommended to set scaled=TRUE when integrated data contains various scales of features.

pca.method

Either a character string ("mixed", "independent", or "original") or logical variable (TRUE or FALSE) to specify how to perform PCA on non-genetic data (PCA is always performed on genetic data). The character strings are used when analyzing integrated (genetic plus non-genetic) data. If using "mixed" (default), PCA is perfromed across the genetic and non-genetic data, resulting in each PC summarizing mixed variations of genetic and non-genetic data. If using "independent", PCA is independently performed on non-genetic data. Genetic PCs and non-genetic PCs are then used as new features. If using "original", original non-genetic data and genetic PCs are used as features. The logical variable is used when analyzing non-genetic data alone. If TRUE, it performs PCA on the training data and applys the loadings to the test data. Scores of training and test data will be used as new features.

pca.PCs

A criterion ("Kaiser-Guttman","broken-stick", or numeric) to retain number of PCs. By default, it uses Kaiser-Guttman criterion that any PC has the eigenvalue greater than 1 will be retained as the new variable/feature. Users can set an integer to specify the number of PCs to be retained.

pca.loadings

A logical variable (False or True) to determine whether it prints the loadings of training data to output text files. Default is False, if set True, the overall output files could be large.

model

A character string to specify which classifier to use for creating predictive models. The current options include "lda", "svm", "naiveBayes", "tree", and "randomForest".

svm.kernel

A character string to specify which kernel to be used when using "svm" classifier. Default is "linear". Other options include "polynomial", "radial", and "sigmoid". Look up R pacakge e1071 for more details about SVM, or see a guidance at https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

svm.cost

A number to specify the cost for "svm" method.

ntree

A integer to specify how many trees to build when using "randomForest" method.

processors

The number of processors to be used for parallel running. By default, it uses N-1 processors in your computer.

multiprocess

A logical variable to determine whether using multiprocess. Default is TRUE. If set FALSE, it will only use single core to run the program.

skipQ

A logical variable to determine whether prompting interactive dialogue when analyzing non-genetic data. If set TRUE, default data type and original values of non-genetic data will be used.

...

Other arguments that could be potentially used for various models

Value

You don't need to specify a name for the returned object when using this function. It automatically outputs results in text files to your designated folder.


[Package assignPOP version 1.3.0 Index]