locus_train {HaploCatcher}R Documentation

Haplotype Prediction: Training Models for Use in Forward Prediction

Description

This function is used to train a model for use in forward prediction of lines which have no record

Usage

locus_train(
  geno_mat,
  gene_file,
  gene_name,
  marker_info,
  chromosome,
  ncor_markers = 50,
  n_neighbors = 50,
  include_hets = FALSE,
  verbose = FALSE,
  set_seed = NULL,
  models_request = "all",
  graph = FALSE
)

Arguments

geno_mat

An imputed, number-coded, genotypic matrix which has n rows of individuals and m columns of markers. Row names of the matrix should be representative of genotypic IDs and column names should be representative of marker IDs. Missing data is not allowed. Numeric coding of genotypes can vary as long as it remains consistant among markers.

gene_file

A dataframe containing at least three columns labeled as follows: 'Gene', 'FullSampleName', and 'Call'. The 'Gene' column contains the name of the gene for which the observation belongs to. The 'FullSampleName' column contains the genotypic ID which corresponds exactly to the column name in the genotypic matrix. The 'Call' column contains the marker call which corresponds to the gene for that genotype. Other information may be present in this dataframe beyond these columns, but the three listed columns above are obligatory.

gene_name

A character string which matches the name of the gene which you are trying to perform cross validation for. This character string must be present in your gene_file 'Gene' column.

marker_info

A dataframe containing the following three columns: 'Marker', 'Chromosome', and 'BP_Position'. The 'Marker' column contains the names of the marker which are present in the genotypic matrix. The 'Chromosome' column contains the corresponding chromosome/linkage group to which the marker belongs. The 'Position' column contains the physical or centimorgan position of the marker. All markers present in the genotypic matrix must be listed in this dataframe. If physical or centimorgan positions are unavailable for the listed markers, a numeric dummy variable ranging from one to n number of markers may be provided instead.

chromosome

A character string which matches the name of the chromosome upon which the gene resides. This chromosome name must be present in the marker_info file.

ncor_markers

A numeric variable which represents the number of markers the user want to use in model training. Correlation among markers to the gene call is calculated and the top n markers specified are retained for training. The default setting is 50 markers.

n_neighbors

A numeric variable which represents the number of neighbors to use in KNN. Default is 50.

include_hets

A logical variable which determines if the user wishes to include heterozygous calls or not. The default setting is FALSE.

verbose

A logical variable which determines if the user wants text feedback. Default setting is TRUE.

set_seed

A numeric variable that is used to set a seed for reproducible results if the user is running the function once for use in the "locus_pred" function. If the user wishes to run the function many times with a random seed and decide the outcome by voting, use the function "locus_voting" instead. The default setting is NULL.

models_request

A character string which defines what models are to be ran. K-nearest neighbors is abbreviated as "knn" and random forest is "rf". If both models are desired, use the text string "all". Default setting is "all".

graph

A logical variable which determines if the user wants graphs displayed. default setting is FALSE.

Value

This function returns a list of list which contains: "seed", "models_request" ,"models", and "data". The "seed" object is the seed set by the user. If no seed was provided this will appear as a character stating "no_seed_set". The "models_request" item hold the models requested. The "models" object contains the trained models. The "data" object contains the data used to train the models.

Examples


#set seed for reproducible sampling
set.seed(022294)

#read in the genotypic data matrix
data("geno_mat")

#read in the marker information
data("marker_info")

#read in the gene compendium file
data("gene_comp")

#Note: in practice you would have something like a gene file
#that does not contain any lines you are trying to predict.
#However, this is for illustrative purposes on how to run the function

#sample data in the gene_comp file to make a traning population
train<-gene_comp[gene_comp$FullSampleName %in%
                   sample(gene_comp$FullSampleName,
                          round(length(gene_comp$FullSampleName)*0.8),0),]

#pull vector of names, not in the train, for forward prediction
test<-gene_comp[!gene_comp$FullSampleName
                %in% train$FullSampleName,
                "FullSampleName"]

#run the function with hets
fit<-locus_train(geno_mat=geno_mat, #the genotypic matrix
                 gene_file=train, #the gene compendium file
                 gene_name="sst1_solid_stem", #the name of the gene
                 marker_info=marker_info, #the marker information file
                 chromosome="3B", #name of the chromosome
                 ncor_markers=2, #number of markers to retain
                 n_neighbors=3, #number of neighbors
                 include_hets=FALSE, #include hets in the model
                 verbose = FALSE, #allows for text and graph output
                 set_seed = 022294, #sets a seed for reproduction of results
                 models = "knn") #sets what models are requested

#predict the lines in the test population
pred<-locus_pred(locus_train_results=fit,
                 geno_mat=geno_mat,
                 genotypes_to_predict=test)

#see predictions
head(pred)


[Package HaploCatcher version 1.0.4 Index]