h2o.kmeans {h2o} | R Documentation |
Performs k-means clustering on an H2O dataset
Description
Performs k-means clustering on an H2O dataset
Usage
h2o.kmeans(
training_frame,
x,
model_id = NULL,
validation_frame = NULL,
nfolds = 0,
keep_cross_validation_models = TRUE,
keep_cross_validation_predictions = FALSE,
keep_cross_validation_fold_assignment = FALSE,
fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"),
fold_column = NULL,
ignore_const_cols = TRUE,
score_each_iteration = FALSE,
k = 1,
estimate_k = FALSE,
user_points = NULL,
max_iterations = 10,
standardize = TRUE,
seed = -1,
init = c("Random", "PlusPlus", "Furthest", "User"),
max_runtime_secs = 0,
categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary",
"Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"),
export_checkpoints_dir = NULL,
cluster_size_constraints = NULL
)
Arguments
training_frame |
Id of the training data frame. |
x |
A vector containing the |
model_id |
Destination id for this model; auto-generated if not specified. |
validation_frame |
Id of the validation data frame. |
nfolds |
Number of folds for K-fold cross-validation (0 to disable or >= 2). Defaults to 0. |
keep_cross_validation_models |
|
keep_cross_validation_predictions |
|
keep_cross_validation_fold_assignment |
|
fold_assignment |
Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO. |
fold_column |
Column with cross-validation fold index assignment per observation. |
ignore_const_cols |
|
score_each_iteration |
|
k |
The max. number of clusters. If estimate_k is disabled, the model will find k centroids, otherwise it will find up to k centroids. Defaults to 1. |
estimate_k |
|
user_points |
This option allows you to specify a dataframe, where each row represents an initial cluster center. The user- specified points must have the same number of columns as the training observations. The number of rows must equal the number of clusters |
max_iterations |
Maximum training iterations (if estimate_k is enabled, then this is for each inner Lloyds iteration) Defaults to 10. |
standardize |
|
seed |
Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number). |
init |
Initialization mode Must be one of: "Random", "PlusPlus", "Furthest", "User". Defaults to Furthest. |
max_runtime_secs |
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0. |
categorical_encoding |
Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO. |
export_checkpoints_dir |
Automatically export generated models to this directory. |
cluster_size_constraints |
An array specifying the minimum number of points that should be in each cluster. The length of the constraints array has to be the same as the number of clusters. |
Value
an object of class H2OClusteringModel.
See Also
h2o.cluster_sizes
, h2o.totss
, h2o.num_iterations
, h2o.betweenss
, h2o.tot_withinss
, h2o.withinss
, h2o.centersSTD
, h2o.centers
Examples
## Not run:
library(h2o)
h2o.init()
prostate_path <- system.file("extdata", "prostate.csv", package = "h2o")
prostate <- h2o.uploadFile(path = prostate_path)
h2o.kmeans(training_frame = prostate, k = 10, x = c("AGE", "RACE", "VOL", "GLEASON"))
## End(Not run)