KMeansTrainer {superml} | R Documentation |
K-Means Trainer
Description
Trains a k-means machine learning model in R
Details
Trains a unsupervised K-Means clustering algorithm. It borrows mini-batch k-means function from ClusterR package written in c++, hence it is quite fast.
Public fields
clusters
the number of clusters
batch_size
the size of the mini batches
num_init
number of times the algorithm will be run with different centroid seeds
max_iters
the maximum number of clustering iterations
init_fraction
percentage of data to use for the initialization centroids (applies if initializer is kmeans++ or optimal_init). Should be a float number between 0.0 and 1.0.
initializer
the method of initialization. One of, optimal_init, quantile_init, kmeans++ and random.
early_stop_iter
continue that many iterations after calculation of the best within-cluster-sum-ofsquared-error
verbose
either TRUE or FALSE, indicating whether progress is printed during clustering
centroids
a matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data
tol
a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) "tol" is greater than the squared norm of the centroids, then kmeans has converged
tol_optimal_init
tolerance value for the ’optimal_init’ initializer. The higher this value is, the far appart from each other the centroids are.
seed
integer value for random number generator (RNG)
model
use for internal purpose
max_clusters
either a numeric value, a contiguous or non-continguous numeric vector specifying the cluster search space
Methods
Public methods
Method new()
Usage
KMeansTrainer$new( clusters, batch_size = 10, num_init = 1, max_iters = 100, init_fraction = 1, initializer = "kmeans++", early_stop_iter = 10, verbose = FALSE, centroids = NULL, tol = 1e-04, tol_optimal_init = 0.3, seed = 1, max_clusters = NA )
Arguments
clusters
numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
batch_size
nuemric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
num_init
integer, use top features sorted by count to be used in bag of words matrix.
max_iters
character, regex expression to use for text cleaning.
init_fraction
list, a list of stopwords to use, by default it uses its inbuilt list of standard stopwords
initializer
character, splitting criteria for strings, default: " "
early_stop_iter
continue that many iterations after calculation of the best within-cluster-sum-ofsquared-error
verbose
either TRUE or FALSE, indicating whether progress is printed during clustering
centroids
a matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data
tol
a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) "tol" is greater than the squared norm of the centroids, then kmeans has converged
tol_optimal_init
tolerance value for the ’optimal_init’ initializer. The higher this value is, the far appart from each other the centroids are.
seed
integer value for random number generator (RNG)
max_clusters
either a numeric value, a contiguous or non-continguous numeric vector specifying the cluster search space
Details
Create a new 'KMeansTrainer' object.
Returns
A 'KMeansTrainer' object.
Examples
data <- rbind(replicate(20, rnorm(1e4, 2)), replicate(20, rnorm(1e4, -1)), replicate(20, rnorm(1e4, 5))) km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
Method fit()
Usage
KMeansTrainer$fit(X, y = NULL, find_optimal = FALSE)
Arguments
X
data.frame or matrix containing features
y
NULL only kept here for superml's standard way
find_optimal
logical, to find the optimal clusters automatically
Details
Trains the KMeansTrainer model
Returns
NULL
Examples
data <- rbind(replicate(20, rnorm(1e4, 2)), replicate(20, rnorm(1e4, -1)), replicate(20, rnorm(1e4, 5))) km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6) km_model$fit(data, find_optimal = FALSE)
Method predict()
Usage
KMeansTrainer$predict(X)
Arguments
X
data.frame or matrix
Details
Returns the prediction on test data
Returns
a vector of predictions
Examples
data <- rbind(replicate(20, rnorm(1e4, 2)), replicate(20, rnorm(1e4, -1)), replicate(20, rnorm(1e4, 5))) km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6) km_model$fit(data, find_optimal = FALSE) predictions <- km_model$predict(data)
Method clone()
The objects of this class are cloneable with this method.
Usage
KMeansTrainer$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `KMeansTrainer$new`
## ------------------------------------------------
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
## ------------------------------------------------
## Method `KMeansTrainer$fit`
## ------------------------------------------------
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)
## ------------------------------------------------
## Method `KMeansTrainer$predict`
## ------------------------------------------------
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)
predictions <- km_model$predict(data)