estploidy {gbs2ploidy} | R Documentation |
Discriminate cytotypes using GBS data
Description
Use principal component analysis (PCA) and discriminant analysis (DA) to assign individuals to different cytotype groups based on GBS data. This function can be run with or without a training set of individuals with known cytotypes.
Usage
estploidy(alphas = NA, het = NA, depth = NA, train = FALSE, pl = NA, set = NA,
nclasses = 2, ids = NA, pcs = 1:2)
Arguments
alphas |
a list generated by the |
het |
a numeric vector with the observed heterozygosity for each individual, that is the proportion of SNPs at which the individual was heterozygous. |
depth |
a numeric vector with the mean number of reads per SNP for each individual (all SNPs or just heterozygous SNPs). |
train |
a boolean specifying whether or not a training set with known ploidy should be used. |
pl |
a vector of known ploidies with one entry per individual (use ‘NA’ for individuals with unknown ploidy); only used if |
set |
indixes for the training set; only used if |
nclasses |
the number of cyotypes expected. |
ids |
names or other IDs for individuals. |
pcs |
a vector giving the PC to use for DA. |
Details
Assignment probabilities to different cytotype groups are obtained using PCA and DA, as described in Gompert & Mock (XXXX). Residual hetorzygosity (from regressing het
on depth
) and point estimates (posterior medians) for allelic proportions are first ordinated using PCA (on the centered covariance matrix). The first pcs
PCs are retained for DA. If a training set is not provided, k-means clustering is used to obtain initial classifications (with nclasses
groups), which are then used for leave-one-out cross-validation with DA. In this case, PC loadings and discriminant scores should be evaluated to associate groups from k-means clustering with specific cytotypes (Gompert & Mock, XXXX). If a training set is provided, it is used to define groups for DA and to estimate assignment probabilities without k-means clustering.
Value
estploidy
returns a list with three components:
pp |
A matrix with assignment probabilities for each individual (rows) to each group (columns); the first column gives the |
pcwghts |
A matrix with the variable loadings (PC weights) from the ordination of residual heterozygosity and allelic proportions. Columns correspond with PCs in ascending order (i.e., the PC with the largest eigenvalue is first). |
pcscrs |
A matrix of PC scores from the ordination of residual heterozygosity and allelic proportions. Columns correspond with PCs in ascending order (i.e., the PC with the largest eigenvalue is first). |
Author(s)
Zachariah Gompert
References
Gompert Z. and Mock K. (XXXX) Detection of individual ploidy levels with genotyping-by-sequencing (GBS) analysis. Molecular Ecology Resources, submitted.
See Also
Examples
## load a simulated data set
data(dat)
## Not run:
## obtain posterior estimates of allelic proportions; short chains are used for
## the example, we recommend increasing this to at least 1000 MCMC steps with a
## 500 step burnin
props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5,
mcmc.thin=1)
## calculate observed heterozygosity and depth of coverage from the allele count
## data
hx<-apply(is.na(dat[[1]]+dat[[2]])==FALSE,1,mean)
dx<-apply(dat[[1]]+dat[[2]],1,mean,na.rm=TRUE)
## run estploidy without using known ploidy data
pl<-estploidy(alphas=props,het=hx,depth=dx,train=FALSE,pl=NA,set=NA,nclasses=2,
ids=dat[[3]],pcs=1:2)
## boxplots to visualize posterior assignment probabilities by true ploidy
## (which is known because these are simulated data)
boxplot(pl$pp[,1] ~ dat[[3]],ylab="assignment probability",xlab="ploidy")
## run estploidy with a training data set with known ploidy; the data set is
## split into 100 individuals with known ploidy and 100 that are used for
## inference
truep<-dat[[3]]
trn<-sort(sample(1:200,100,replace=FALSE))
truep[-trn]<-NA
plt<-estploidy(alphas=props,het=hx,depth=dx,train=TRUE,pl=truep,set=trn,
nclasses=2,ids=dat[[3]],pcs=1:2)
## boxplots to visualize posterior assignment probabilities for individuals that
## were not part of the training set by true ploidy (which is known because
## these are simulated data)
boxplot(plt$pp[,1] ~ dat[[3]][-trn],ylab="assignment probability",xlab="ploidy")
## End(Not run)