R: Leader Algorithm for Data Partitioning

leader {probout}

R Documentation

Leader Algorithm for Data Partitioning

Description

Partitions the data according to Hartigan's leader algorithm, and provides ranges, centroids, and variances for the partitions.

Usage

leader(data, radius = NULL, scale = T)

Arguments

`data`	A numeric vector or matrix of observations. If a matrix, rows correspond to observations and columns correspond to variables.
`radius`	A vector of values for the partitioning radius. Wilkinson's default radius is used if `radius` is left unspecified (see function `LWradius`).
`scale`	A logical variable indicating whether or not the data should be mapped to the unit hypercube. The default is to scale the data. Values of the radius will not be scaled; they should be specifed relative to the unit hypercube unless `scale = F`.

Details

Given a partitioning radius r, the leader algorithm makes one pass through the data, designating an observation as a new leader if it is not within r of an existing leader, and otherwise assigning it to the partition associated with the nearest existing leader. The set of leaders typically depends on the order of the data observations.
If radius = 0, then all of the data observations are leaders, and only radius and leaders are returned as output components.
This implementation does a completely new nearest-neighbor search for each observation and for each radius. A more efficient approach would be to maintain, for each radius, a data structure (such as a kd-tree) allowing fast nearest-neighbor search. These data structures could then be updated to account for new observations. Currently, there doesn't seem to be a way to do this in R.

Value

A list with one component for each value of radius, each having the following sub-components:

`radius`	The value of the radius associated with the partitioning.
`partitions`	A list with one component for each partition, giving the indexes (as observations in the data) of the members of the partition. The first index is that of the associated leader (sometimes called exemplar).
`leaders`	The indexes of the leaders for each partition.
`centroids`	The centroids for each partition, as a matrix with rows corresponding to the partitions and columns corresponding to variables if multidimensional. These will be the data if `radius == 0`.
`variances`	The variances for each partition, as a matrix with rows corresponding to the partitions and columns corresponding to variables if multidimensional.
`ranges`	A list with two components: `min` and `max` giving the minimum and maximum values for each variable for each partition. These `range` components are given as a matrix with rows corresponding to the partitions and columns corresponding to variables if multidimensional.
`maxdist`	A vector with one value for each partition, giving the largest distance from each leader to any member of its partition.

References

J. A. Hartigan, Clustering Algorithms, Wiley, 1975.

L. Wilkinson, Visualizing Outliers, Technical Report, University of Illinois at Chicago, 2016. https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf

Examples


 radius.default <- LWradius(nrow(faithful),ncol(faithful))
 lead <- leader(faithful, radius = c(0,radius.default))

# number of partitions for each radius
 sapply(lead, function(x) length(x$partitions))

# plot the leaders for the non-zero radius
 plot( faithful[,1], faithful[,2], 
       main = "blue indicates leaders (default radius)", 
       pch = 16, cex = .5)
 ldrs <- lead[[2]]$leaders
 points( faithful[ldrs,1], faithful[ldrs,2], 
         pch = 8, col = "dodgerblue", cex = .5)

[Package probout version 1.1.2 Index]