R: Construction of prototype alleles

ghap.anctrain {GHap}

R Documentation

Construction of prototype alleles

Description

This function builds prototype alleles to be used in ancestry predictions.

Usage

 ghap.anctrain(object, train = NULL,
               method = "unsupervised",
               K = 2, iter.max = 10, nstart = 10,
               nmarkers = 5000, tune = FALSE,
               only.active.samples = TRUE,
               only.active.markers = TRUE,
               batchsize = NULL, ncores = 1,
               verbose = TRUE)

Arguments

The following arguments are used by both the 'supervised' and 'unsupervised' methods:

`object`	A GHap.phase object.
`train`	Character vector of individuals to use as reference samples. All active individuals are used if this vector is not provided.
`method`	Character value indicating which method to use: 'supervised' or 'unsupervised' (default).
`only.active.samples`	A logical value specifying whether only active samples should be included in predictions (default = TRUE).
`only.active.markers`	A logical value specifying whether only active markers should be used for predictions (default = TRUE).
`batchsize`	A numeric value controlling the number of markers to be processed at a time (default = nmarkers/10).
`ncores`	A numeric value specifying the number of cores to be used in parallel computing (default = 1).
`verbose`	A logical value specfying whether log messages should be printed (default = TRUE).

The following arguments are only used by the 'unsupervised' method:

`K`	A numeric value specifying the number of clusters in K-means (default = 2). Proxy for the number of ancestral populations.
`iter.max`	A numeric value specifying the maximum number of iterations of the K-means clustering (default = 10).
`nstart`	A numeric value specifying the number of independent runs of the K-means clustering (default = 10).
`nmarkers`	A numeric value specifying the number of seeding markers to be used by the K-means clustering (default = 10).
`tune`	A logical value specfying if a Best K analysis should be performed (default = FALSE).

Details

This function builds prototype alleles (i.e., cluster centroids, representing lineage-specific allele frequencies) through two methods:

The 'unsupervised' method uses the K-means clustering algorithm to group haplotypes into K pseudo-lineages. A random sample of seeding markers (default value of nmarkers = 5000) is used to group all 2*nsamples haplotypes in a user-specified number of clusters (default value of K = 2). Then, for each interrogated block, prototype alleles are built for every cluster using the arithmetic mean of observed haplotypes initially assigned to that cluster. If train = NULL, the function uses all active haplotypes to build prototype alleles. If the user is working with a severely unbalanced data set (ex. one population with a large number of individuals and others with few individuals), it is recommended that a vector of individual names is provided via the train argument such that prototype alleles are built using a more balanced subset of the data.

The 'supervised' method works in a similar way, but skips the K-means algorithm and uses population labels present in the GHap.phase object as clusters.

Value

The function returns a dataframe with the first column giving marker names and remaining columns containing prototype alleles for each pseudo-lineage. If method 'unsupervised' is ran with tune = TRUE, the function returns the following list:

`ssq`	Within-cluster sum of squares for each value of K.
`chindex`	Calinski Harabasz Index for consecutive values of K.
`pchange`	Percent change in ssq for consecutive values of K.

Author(s)

Yuri Tani Utsunomiya <ytutsunomiya@gmail.com>

References

Y.T. Utsunomiya et al. Unsupervised detection of ancestry tracks with the GHap R package. Methods in Ecology and Evolution. 2020. 11:1448–54.

Examples


# #### DO NOT RUN IF NOT NECESSARY ###
# 
# # Copy phase data in the current working directory
# exfiles <- ghap.makefile(dataset = "example",
#                          format = "phase",
#                          verbose = TRUE)
# file.copy(from = exfiles, to = "./")
# 
# # Load phase data
# 
# phase <- ghap.loadphase("example")
# 
# ### RUN ###
# 
# # Calculate marker density
# mrkdist <- diff(phase$bp)
# mrkdist <- mrkdist[which(mrkdist > 0)]
# density <- mean(mrkdist)
# 
# # Generate blocks for admixture events up to g = 10 generations in the past
# # Assuming mean block size in Morgans of 1/(2*g)
# # Approximating 1 Morgan ~ 100 Mbp
# g <- 10
# window <- (100e+6)/(2*g)
# window <- ceiling(window/density)
# step <- ceiling(window/4)
# blocks <- ghap.blockgen(phase, windowsize = window,
#                         slide = step, unit = "marker")
# 
# # BestK analysis
# bestK <- ghap.anctrain(object = phase, K = 5, tune = TRUE)
# plot(bestK$ssq, type = "b", xlab = "K",
#      ylab = "Within-cluster sum of squares")
# 
# # Unsupervised analysis with best K
# prototypes <- ghap.anctrain(object = phase, K = 2)
# hapadmix <- ghap.anctest(object = phase,
#                          blocks = blocks,
#                          prototypes = prototypes,
#                          test = unique(phase$id))
# anctracks <- ghap.ancsmooth(object = phase, admix = hapadmix)
# ghap.ancplot(ancsmooth = anctracks)
# 
# # Supervised analysis
# train <- unique(phase$id[which(phase$pop != "Cross")])
# prototypes <- ghap.anctrain(object = phase, train = train,
#                             method = "supervised")
# hapadmix <- ghap.anctest(object = phase,
#                          blocks = blocks,
#                          prototypes = prototypes,
#                          test = unique(phase$id))
# anctracks <- ghap.ancsmooth(object = phase, admix = hapadmix)
# ghap.ancplot(ancsmooth = anctracks)

[Package GHap version 3.0.0 Index]