leader {probout} | R Documentation |
Leader Algorithm for Data Partitioning
Description
Partitions the data according to Hartigan's leader algorithm, and provides ranges, centroids, and variances for the partitions.
Usage
leader(data, radius = NULL, scale = T)
Arguments
data |
A numeric vector or matrix of observations. If a matrix, rows correspond to observations and columns correspond to variables. |
radius |
A vector of values for the partitioning radius. Wilkinson's default
radius is used if |
scale |
A logical variable indicating whether or not the data should be mapped
to the unit hypercube. The default is to scale the data. Values of the
radius will not be scaled; they should be specifed relative to the unit
hypercube unless |
Details
Given a partitioning radius r
, the leader algorithm makes one pass
through the data, designating an observation as a new leader if it is not
within r
of an existing leader, and otherwise assigning it to the
partition associated with the nearest existing leader. The set of leaders
typically depends on the order of the data observations.
If radius = 0
, then all of the data observations are leaders, and
only radius
and leaders
are returned as output components.
This implementation does a completely new nearest-neighbor search for
each observation and for each radius. A more efficient approach would be to
maintain, for each radius, a data structure (such as a kd-tree) allowing
fast nearest-neighbor search. These data structures could then be updated
to account for new observations. Currently, there doesn't seem to be a way
to do this in R.
Value
A list with one component for each value of radius
, each having the
following sub-components:
radius |
The value of the radius associated with the partitioning. |
partitions |
A list with one component for each partition, giving the indexes (as observations in the data) of the members of the partition. The first index is that of the associated leader (sometimes called exemplar). |
leaders |
The indexes of the leaders for each partition. |
centroids |
The centroids for each partition, as a matrix with rows corresponding to
the partitions and columns corresponding to variables if multidimensional.
These will be the data if |
variances |
The variances for each partition, as a matrix with rows corresponding to the partitions and columns corresponding to variables if multidimensional. |
ranges |
A list with two components: |
maxdist |
A vector with one value for each partition, giving the largest distance from each leader to any member of its partition. |
References
J. A. Hartigan, Clustering Algorithms, Wiley, 1975.
L. Wilkinson, Visualizing Outliers, Technical Report, University of
Illinois at Chicago, 2016.
https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf
See Also
Examples
radius.default <- LWradius(nrow(faithful),ncol(faithful))
lead <- leader(faithful, radius = c(0,radius.default))
# number of partitions for each radius
sapply(lead, function(x) length(x$partitions))
# plot the leaders for the non-zero radius
plot( faithful[,1], faithful[,2],
main = "blue indicates leaders (default radius)",
pch = 16, cex = .5)
ldrs <- lead[[2]]$leaders
points( faithful[ldrs,1], faithful[ldrs,2],
pch = 8, col = "dodgerblue", cex = .5)