TGL_kmeans_tidy {tglkmeans} | R Documentation |
TGL kmeans with 'tidy' output
Description
TGL kmeans with 'tidy' output
Usage
TGL_kmeans_tidy(
df,
k,
metric = "euclid",
max_iter = 40,
min_delta = 0.0001,
verbose = FALSE,
keep_log = FALSE,
id_column = FALSE,
reorder_func = "hclust",
add_to_data = FALSE,
hclust_intra_clusters = FALSE,
seed = NULL,
parallel = getOption("tglkmeans.parallel"),
use_cpp_random = FALSE
)
Arguments
df |
a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used. |
k |
number of clusters. Note that in some cases the algorithm might return less clusters than k. |
metric |
distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman' |
max_iter |
maximal number of iterations |
min_delta |
minimal change in assignments (fraction out of all observations) to continue iterating |
verbose |
display algorithm messages |
keep_log |
keep algorithm messages in 'log' field |
id_column |
|
reorder_func |
function to reorder the clusters. operates on each center and orders by the result. e.g. |
add_to_data |
return also the original data frame with an extra 'clust' column with the cluster ids ('id' is the first column) |
hclust_intra_clusters |
run hierarchical clustering within each cluster and return an ordering of the observations. |
seed |
seed for the c++ random number generator |
parallel |
cluster every cluster parallelly (if hclust_intra_clusters is true) |
use_cpp_random |
use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R. |
Value
list with the following components:
- cluster:
tibble with 'id' column with the observation id ('1:n' if no id column was supplied), and 'clust' column with the observation assigned cluster.
- centers:
tibble with 'clust' column and the cluster centers.
- size:
tibble with 'clust' column and 'n' column with the number of points in each cluster.
- data:
tibble with 'clust' column the original data frame.
- log:
messages from the algorithm run (only if
id_column = FALSE
).- order:
tibble with 'id' column, 'clust' column, 'order' column with a new ordering if the observations and 'intra_clust_order' column with the order within each cluster. (only if hclust_intra_clusters = TRUE)
See Also
Examples
# create 5 clusters normally distributed around 1:5
d <- simulate_data(
n = 100,
sd = 0.3,
nclust = 5,
dims = 2,
add_true_clust = FALSE,
id_column = FALSE
)
head(d)
# cluster
km <- TGL_kmeans_tidy(d, k = 5, "euclid", verbose = TRUE)
km