R: Clustering of multilocus genotypes

amCluster {allelematch}

R Documentation

Clustering of multilocus genotypes

Description

Performs clustering of multilocus genotypes to identify unique consensus and singleton genotypes and generates analysis output in formatted text, HTML, or CSV. These functions are usually called by amUnique. This interface remains to enable a better understanding of how amUnique operates. For more information see example.

There are three steps to this analysis: (1) identify the dissimilarity between pairs of genotypes using a metric which takes missing data into account, (2) cluster this dissimilarity matrix using a standard hierarchical agglomerative clustering approach, and (3) use a dynamic tree cutting approach to identify clusters.

Usage

	amCluster(
		amDatasetFocal,
		runUntilSingletons = TRUE,
		cutHeight = 0.3,
		missingMethod = 2,
		consensusMethod = 1,
		clusterMethod = "complete"
		)

	amHTML.amCluster(
		x,
		htmlFile = NULL,
		htmlCSS = amCSSForHTML()
		)

	amCSV.amCluster(
		x,
		csvFile
		)

  ## S3 method for class 'amCluster'
summary(
    object,
		html = NULL,
		csv = NULL,
		...
		)

Arguments

`amDatasetFocal`	An `amDataset` object containing genotypes to cluster.
`runUntilSingletons`	When `runUntilSingletons = TRUE`, the analysis runs recursively with the unique individuals determined in one analysis feeding into the next until no more clusters are formed; applicable when the goal is to thin a dataset to unique genotypes. For more manual control over the process, use `runUntilSingletons = FALSE`. See details and examples.
`cutHeight`	Sets the tree cutting height using the hybrid method in the `dynamicTreeCut` package. See details and `cutreeHybrid` for more information.
`missingMethod`	The method used to determine the similarity of multilocus genotypes when data is missing. The default, (`missingMethod = 2`), is preferable in all cases. See `amMatrix`.
`consensusMethod`	The method (an integer) used to determine the consensus multilocus genotype from a cluster of multilocus genotypes. See details.
`clusterMethod`	The method used by `hclust` for clustering. Only the default `clusterMethod = "complete"` performs acceptably in simulations. This option remains for experimental reasons.
`object`, `x`	An `amPairwise` object.
`htmlFile`	HTML filepath to create. If `htmlFile = NULL`, a file is created in the operating system temporary directory and is then opened in the default browser.
`htmlCSS`	String containing a valid cascading style sheet. A default style sheet is provided in `amCSSForHTML`. See `amCSSForHTML` for details of how to tweak this CSS.
`html`	If `html = NULL` or `html=FALSE`, formatted textual output is displayed on the console. If `html = TRUE`, the `summary.amCluster` method produces and loads an HTML file in the default browser. `html` can also contain a path to a file where HTML output will be written.
`csvFile`, `csv`	CSV filepath to create containing only the unique genotypes determined in the clustering.
`...`	Additional arguments to `summary.amCluster`

Details

Selecting an appropriate cutHeight parameter (also known as the d-hat criterion) is essential. Typically this function is called from amUnique, and the conversion between alleleMismatch (m-hat) and cutHeight (d-hat) will be done automatically. Selecting an appropriate value for alleleMismatch (m-hat) can be done using amUniqueProfile. See the supplementary documentation for an explanation of how these parameters are related.

runUntilSingletons=TRUE provides an efficient and reliable way to determine the unique individuals in a dataset if the dataset meets certain criteria. To understand how the clustering is thinning the dataset run this recursion manually using runUntilSingletons=FALSE. An example is provided below.

cutHeight in practice gives the amount of dissimilarity (using the metric described in amMatrix) required for two multilocus genotypes to be declared different (also known as d-hat). The default setting for consensusMethod performs well.

`consensusMethod`
`1`	Genotype with max similarity to others in the cluster is consensus (DEFAULT)
`2`	Genotype with max similarity to others in the cluster is consensus then interpolate missing alleles using mode non-missing allele in each column
`3`	Genotype with min missing data is consensus
`4`	Genotype with min missing data is consensus then interpolate missing alleles using mode non-missing allele in each column

Value

amCluster object or side effects: analysis summary written to an HTML file or to the console, or written to a CSV file.

Note

There is an additional side effect of html = TRUE (or of htmlFile = NULL). If required, there is a clean up of the operating system temporary directory where AlleleMatch temporary HTML files are stored. Files that match the pattern am*.html and are older than 24 hours are deleted from this temporary directory.

Author(s)

Paul Galpern (pgalpern@gmail.com)

References

For a complete vignette, please access via the Data S1 Supplementary documentation and tutorials (PDF) located at <doi:10.1111/j.1755-0998.2012.03137.x>.

Examples

	## Not run: 
	data("amExample5")

	## Produce amDataset object
	myDataset <-
		amDataset(
			amExample5,
			missingCode = "-99",
			indexColumn = 1,
			metaDataColumn = 2,
			ignoreColumn = "gender"
			)

	## Usage
	myCluster <-
		amCluster(
			myDataset,
			cutHeight = 0.2
			)

	## Display analysis as HTML in default browser
	summary.amCluster(
		myCluster,
		html = TRUE
		)

	## Save analysis to HTML file
	summary.amCluster(
		myCluster,
		html = "myCluster.htm"
		)

	## Display analysis as formatted text on the console
	summary.amCluster(myCluster)

	## Save unique genotypes only to a CSV file
	summary.amCluster(
		myCluster,
		csv = "myCluster.csv"
		)

	## Demonstration of how amCluster operates
	## Manual control over the recursion in amCluster()
	summary.amCluster(
		myCluster1 <-
			amCluster(
				myDataset,
				runUntilSingletons = FALSE,
				cutHeight = 0.2
				),
			html = TRUE
			)
	summary.amCluster(
		myCluster2 <-
			amCluster(
				myCluster1$unique,
				runUntilSingletons = FALSE,
				cutHeight = 0.2
				),
			html = TRUE
			)
	summary.amCluster(
		myCluster3 <-
			amCluster(
				myCluster2$unique,
				runUntilSingletons = FALSE,
				cutHeight = 0.2
				),
			html = TRUE
			)
	summary.amCluster(
		myCluster4 <-
			amCluster(
				myCluster3$unique,
				runUntilSingletons = FALSE,
				cutHeight = 0.2
				),
			html = TRUE
			)
	## No more clusters, therefore stop.
	
## End(Not run)

[Package allelematch version 2.5.4 Index]