simple_kmeans_db {modeldb} | R Documentation |
Simple kmeans routine that works in-database
Description
It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean calculations, expressed in dplyr formulas until an the optimal centers are found.
Usage
simple_kmeans_db(
df,
...,
centers = 3,
max_repeats = 100,
initial_kmeans = NULL,
safeguard_file = "kmeans.csv",
verbose = TRUE
)
Arguments
df |
A Local or remote data frame |
... |
A list of variables to be used in the kmeans algorithm |
centers |
The number of centers. Defaults to 3. |
max_repeats |
The maximum number of cycles to run. Defaults to 100. |
initial_kmeans |
A local dataframe with initial centroid values. Defaults to NULL. |
safeguard_file |
Each cycle will update a file specified in this argument with the current centers. Defaults to 'kmeans.csv'. Pass NULL if no file is desired. |
verbose |
Indicates if the progress bar will be displayed during the model's fitting. |
Details
Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source,
the latest centroid data frame is saved to the parent environment in case the process needs to be
canceled and then restarted at a later point. Passing the current_kmeans
as the initial_kmeans
will allow the operation to pick up where it left off.
Examples
library(dplyr)
mtcars %>%
simple_kmeans_db(mpg, qsec, wt) %>%
glimpse()