| KMeansTrainer {superml} | R Documentation |
K-Means Trainer
Description
Trains a k-means machine learning model in R
Details
Trains a unsupervised K-Means clustering algorithm. It borrows mini-batch k-means function from ClusterR package written in c++, hence it is quite fast.
Public fields
clustersthe number of clusters
batch_sizethe size of the mini batches
num_initnumber of times the algorithm will be run with different centroid seeds
max_itersthe maximum number of clustering iterations
init_fractionpercentage of data to use for the initialization centroids (applies if initializer is kmeans++ or optimal_init). Should be a float number between 0.0 and 1.0.
initializerthe method of initialization. One of, optimal_init, quantile_init, kmeans++ and random.
early_stop_itercontinue that many iterations after calculation of the best within-cluster-sum-ofsquared-error
verboseeither TRUE or FALSE, indicating whether progress is printed during clustering
centroidsa matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data
tola float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) "tol" is greater than the squared norm of the centroids, then kmeans has converged
tol_optimal_inittolerance value for the ’optimal_init’ initializer. The higher this value is, the far appart from each other the centroids are.
seedinteger value for random number generator (RNG)
modeluse for internal purpose
max_clusterseither a numeric value, a contiguous or non-continguous numeric vector specifying the cluster search space
Methods
Public methods
Method new()
Usage
KMeansTrainer$new( clusters, batch_size = 10, num_init = 1, max_iters = 100, init_fraction = 1, initializer = "kmeans++", early_stop_iter = 10, verbose = FALSE, centroids = NULL, tol = 1e-04, tol_optimal_init = 0.3, seed = 1, max_clusters = NA )
Arguments
clustersnumeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
batch_sizenuemric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
num_initinteger, use top features sorted by count to be used in bag of words matrix.
max_iterscharacter, regex expression to use for text cleaning.
init_fractionlist, a list of stopwords to use, by default it uses its inbuilt list of standard stopwords
initializercharacter, splitting criteria for strings, default: " "
early_stop_itercontinue that many iterations after calculation of the best within-cluster-sum-ofsquared-error
verboseeither TRUE or FALSE, indicating whether progress is printed during clustering
centroidsa matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data
tola float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) "tol" is greater than the squared norm of the centroids, then kmeans has converged
tol_optimal_inittolerance value for the ’optimal_init’ initializer. The higher this value is, the far appart from each other the centroids are.
seedinteger value for random number generator (RNG)
max_clusterseither a numeric value, a contiguous or non-continguous numeric vector specifying the cluster search space
Details
Create a new 'KMeansTrainer' object.
Returns
A 'KMeansTrainer' object.
Examples
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
Method fit()
Usage
KMeansTrainer$fit(X, y = NULL, find_optimal = FALSE)
Arguments
Xdata.frame or matrix containing features
yNULL only kept here for superml's standard way
find_optimallogical, to find the optimal clusters automatically
Details
Trains the KMeansTrainer model
Returns
NULL
Examples
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)
Method predict()
Usage
KMeansTrainer$predict(X)
Arguments
Xdata.frame or matrix
Details
Returns the prediction on test data
Returns
a vector of predictions
Examples
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)
predictions <- km_model$predict(data)
Method clone()
The objects of this class are cloneable with this method.
Usage
KMeansTrainer$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
Examples
## ------------------------------------------------
## Method `KMeansTrainer$new`
## ------------------------------------------------
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
## ------------------------------------------------
## Method `KMeansTrainer$fit`
## ------------------------------------------------
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)
## ------------------------------------------------
## Method `KMeansTrainer$predict`
## ------------------------------------------------
data <- rbind(replicate(20, rnorm(1e4, 2)),
replicate(20, rnorm(1e4, -1)),
replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)
predictions <- km_model$predict(data)