R: Discriminate cytotypes using GBS data

estploidy {gbs2ploidy}

R Documentation

Discriminate cytotypes using GBS data

Description

Use principal component analysis (PCA) and discriminant analysis (DA) to assign individuals to different cytotype groups based on GBS data. This function can be run with or without a training set of individuals with known cytotypes.

Usage

estploidy(alphas = NA, het = NA, depth = NA, train = FALSE, pl = NA, set = NA, 
    nclasses = 2, ids = NA, pcs = 1:2)

Arguments

`alphas`	a list generated by the `estprops` function that contains posterior estimates of allelic proportions.
`het`	a numeric vector with the observed heterozygosity for each individual, that is the proportion of SNPs at which the individual was heterozygous.
`depth`	a numeric vector with the mean number of reads per SNP for each individual (all SNPs or just heterozygous SNPs).
`train`	a boolean specifying whether or not a training set with known ploidy should be used.
`pl`	a vector of known ploidies with one entry per individual (use ‘NA’ for individuals with unknown ploidy); only used if `train == TRUE`.
`set`	indixes for the training set; only used if `train == TRUE`.
`nclasses`	the number of cyotypes expected.
`ids`	names or other IDs for individuals.
`pcs`	a vector giving the PC to use for DA.

Details

Assignment probabilities to different cytotype groups are obtained using PCA and DA, as described in Gompert & Mock (XXXX). Residual hetorzygosity (from regressing het on depth) and point estimates (posterior medians) for allelic proportions are first ordinated using PCA (on the centered covariance matrix). The first pcs PCs are retained for DA. If a training set is not provided, k-means clustering is used to obtain initial classifications (with nclasses groups), which are then used for leave-one-out cross-validation with DA. In this case, PC loadings and discriminant scores should be evaluated to associate groups from k-means clustering with specific cytotypes (Gompert & Mock, XXXX). If a training set is provided, it is used to define groups for DA and to estimate assignment probabilities without k-means clustering.

Value

estploidy returns a list with three components:

`pp`	A matrix with assignment probabilities for each individual (rows) to each group (columns); the first column gives the `ids` provided by the user. Only individuals that were not part of the training set are included.
`pcwghts`	A matrix with the variable loadings (PC weights) from the ordination of residual heterozygosity and allelic proportions. Columns correspond with PCs in ascending order (i.e., the PC with the largest eigenvalue is first).
`pcscrs`	A matrix of PC scores from the ordination of residual heterozygosity and allelic proportions. Columns correspond with PCs in ascending order (i.e., the PC with the largest eigenvalue is first).

Author(s)

Zachariah Gompert

References

Gompert Z. and Mock K. (XXXX) Detection of individual ploidy levels with genotyping-by-sequencing (GBS) analysis. Molecular Ecology Resources, submitted.

Examples

## load a simulated data set
data(dat)
## Not run: 
## obtain posterior estimates of allelic proportions; short chains are used for 
## the example, we recommend increasing this to at least 1000 MCMC steps with a
## 500 step burnin
props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5,
    mcmc.thin=1)

## calculate observed heterozygosity and depth of coverage from the allele count
## data
hx<-apply(is.na(dat[[1]]+dat[[2]])==FALSE,1,mean)
dx<-apply(dat[[1]]+dat[[2]],1,mean,na.rm=TRUE)

## run estploidy without using known ploidy data
pl<-estploidy(alphas=props,het=hx,depth=dx,train=FALSE,pl=NA,set=NA,nclasses=2,
    ids=dat[[3]],pcs=1:2)

## boxplots to visualize posterior assignment probabilities by true ploidy 
## (which is known because these are simulated data)
boxplot(pl$pp[,1] ~ dat[[3]],ylab="assignment probability",xlab="ploidy")

## run estploidy with a training data set with known ploidy; the data set is 
## split into 100 individuals with known ploidy and 100 that are used for 
## inference
truep<-dat[[3]]
trn<-sort(sample(1:200,100,replace=FALSE))
truep[-trn]<-NA
plt<-estploidy(alphas=props,het=hx,depth=dx,train=TRUE,pl=truep,set=trn,
    nclasses=2,ids=dat[[3]],pcs=1:2)

## boxplots to visualize posterior assignment probabilities for individuals that
## were not part of the training set by true ploidy (which is known because 
## these are simulated data)
boxplot(plt$pp[,1] ~ dat[[3]][-trn],ylab="assignment probability",xlab="ploidy")

## End(Not run)

[Package gbs2ploidy version 1.0 Index]