R: Find a target number of clusters in the data using convex...

convex_clustering {CCMMR}

R Documentation

Find a target number of clusters in the data using convex clustering

Description

convex_clustering attempts to find the number of clusters specified by the user by means of convex clustering. The algorithm looks for each number of clusters between target_low and target_high. If target_low = target_high, the algorithm searches for a single clustering. It is recommended to specify a range around the desired number of clusters, as not each number of clusters between 1 and nrow(X) may be attainable due to numerical inaccuracies.

Usage

convex_clustering(
  X,
  W,
  target_low,
  target_high = NULL,
  max_iter_phase_1 = 2000,
  max_iter_phase_2 = 20,
  lambda_init = 0.01,
  factor = 0.025,
  tau = 0.001,
  center = TRUE,
  scale = TRUE,
  eps_conv = 1e-06,
  burnin_iter = 25,
  max_iter_conv = 5000,
  save_clusterpath = FALSE,
  verbose = 0
)

Arguments

`X`	An `n` x `p` numeric matrix. This function assumes that each row represents an object with `p` attributes.
`W`	A `sparseweights` object, see sparse_weights.
`target_low`	Lower bound on the number of clusters that should be searched for. If `target_high = NULL`, this is the exact number of clusters that is searched for.
`target_high`	Upper bound on the number of clusters that should be searched for. Default is `NULL`, in that case, it is set equal to `target_low`.
`max_iter_phase_1`	Maximum number of iterations to find an upper and lower bound for the value for lambda for which the desired number of clusters is attained. Default is 2000.
`max_iter_phase_2`	Maximum number of iterations to to refine the upper and lower bounds for lambda. Default is 20.
`lambda_init`	The first value for lambda other than 0 to use for convex clustering. Default is 0.01.
`factor`	The percentage by which to increase lambda in each step. Default is 0.025.
`tau`	Parameter to compute the threshold to fuse clusters. Default is 0.001.
`center`	If `TRUE`, center `X` so that each column has mean zero. Default is `TRUE`.
`scale`	If `TRUE`, scale the loss function to ensure that the cluster solution is invariant to the scale of `X`. Default is `TRUE`. Not recommended to set to `FALSE` unless comparing to algorithms that minimize the unscaled convex clustering loss function.
`eps_conv`	Parameter for determining convergence of the minimization. Default is 1e-6.
`burnin_iter`	Number of updates of the loss function that are done without step doubling. Default is 25.
`max_iter_conv`	Maximum number of iterations for minimizing the loss function. Default is 5000.
`save_clusterpath`	If `TRUE`, store the solution that minimized the loss function for each lambda. Is required for drawing the clusterpath. Default is `FALSE`. To store the clusterpath coordinates, `n` x `p` x `no. lambdas` values have to be stored, this may require too much memory for large data sets.
`verbose`	Verbosity of the information printed during clustering. Default is 0, no output.

Value

A cvxclust object containing the following

`info`	A dataframe containing for each value for lambda: the number of different clusters, and the value of the loss function at the minimum.
`merge`	The merge table containing the order at which the observations in `X` are clustered.
`height`	The value for lambda at which each reduction in the number of clusters occurs.
`order`	The order of the observations in `X` in order to draw a dendrogram without conflicting branches.
`elapsed_time`	The number of seconds that elapsed while running the code. Note that this does not include the time required for input checking and possibly scaling and centering `X`.
`coordinates`	The clusterpath coordinates. Only part of the output in case that `save_clusterpath=TRUE`.
`lambdas`	The values for lambda for which a clustering was found.
`eps_fusions`	The threshold for cluster fusions that was used by the algorithm.
`phase_1_instances`	The number of instances of the loss function that were minimized while finding an upper and lower bound for lambda. The sum `phase_1_iterations + phase_2_iterations` gives the total number of instances solved.
`phase_2_instances`	The number of instances of the loss function that were minimized while refining the value for lambda. The sum `phase_1_iterations + phase_2_iterations` gives the total number of instances solved.
`num_clusters`	The different numbers of clusters that have been found.
`n`	The number of observations in `X`.

Examples

# Load data
data(two_half_moons)
data = as.matrix(two_half_moons)
X = data[, -3]
y = data[, 3]

# Get sparse weights in dictionary of keys format with k = 5 and phi = 8
W = sparse_weights(X, 5, 8.0)

# Perform convex clustering with a target number of clusters
res1 = convex_clustering(X, W, target_low = 2, target_high = 5)

# Plot the clustering for 2 to 5 clusters
oldpar = par(mfrow=c(2, 2))
plot(X, col = clusters(res1, 2), main = "2 clusters", pch = 19)
plot(X, col = clusters(res1, 3), main = "3 clusters", pch = 19)
plot(X, col = clusters(res1, 4), main = "4 clusters", pch = 19)
plot(X, col = clusters(res1, 5), main = "5 clusters", pch = 19)

# A more generalized approach to plotting the results of a range of clusters
res2 = convex_clustering(X, W, target_low = 2, target_high = 7)

# Plot the clusterings
k = length(res2$num_clusters)
par(mfrow=c(ceiling(k / ceiling(sqrt(k))), ceiling(sqrt(k))))

for (i in 1:k) {
    labels = clusters(res2, res2$num_clusters[i])
    c = length(unique(labels))

    plot(X, col = labels, main = paste(c, "clusters"), pch = 19)
}
par(oldpar)