R: k-means clustering

idaKMeans {ibmdbR}

R Documentation

k-means clustering

Description

This function generates a k-means clustering model based on the contents of a IDA data frame (ida.data.frame).

Usage

idaKMeans(
	data,
	id, 
	k=3,
	maxiter=5,
	distance="euclidean",
	outtable=NULL,
	randseed=12345,
	statistics=NULL,
	modelname=NULL
)

## S3 method for class 'idaKMeans'
print(x,...)  
## S3 method for class 'idaKMeans'
predict(object, newdata, id,...)

Arguments

`data`	An IDA data frame that contains the input data for the function. The input IDA data frame must include a column that contains a unique ID for each row.
`id`	The name of the column that contains a unique ID for each row of the input data.
`k`	The number of clusters to be calculated.
`maxiter`	The maximum number of iterations to be used to calculate the k-means clusters. A larger number of iterations increases both the precision of the results and the amount of time required to calculate them.
`distance`	The distance function that is to be used. This can be set to `"euclidean"`, which causes the squared Euclidean distance to be used, or `"norm_euclidean"`, which causes normalized euclidean distance to be used.
`outtable`	The name of the output table that is to contain the results of the operation. When NULL is specified, a table name is generated automatically.
`randseed`	The seed for the random number generator.
`statistics`	Denotes which statistics to calculate. Allowed values are `"none"`,`"columns"` and `"all"`. If NULL, the default of the database system will be used.
`modelname`	The name under which the model is stored in the database. This is the name that is specified when using functions such as `idaRetrieveModel` or `idaDropModel`.
`object`	An object of the class `idaKMeans` to be used for prediction, i.e. for applying it to new data.
`x`	An object of the class `idaKMeans` to be printed.
`newdata`	A IDA data frame that contains the data to which to apply the model.
`...`	Additional parameters to pass to the print or predict method.

Details

The idaKMeans function calculates the squared Euclidean distance between rows, and groups them into clusters. Initial clusters are chosen randomly using a random seed, and the results are adjusted iteratively until either the maximum number of iterations is reached or until two iterations return identical results. Variables with missing values are set zero for distance calculation.

Models are stored persistently in database under the name modelname. Model names cannot have more than 64 characters and cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one model with a given name is allowed in the database at a time. If a model with modelname already exists, you need to drop it with idaDropModel first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel).

The output of the print function for a idaKMeans object is:

A vector containing a list of centers
A vector containing a list of cluster sizes
A vector containing a list of the number of elements in each cluster
A data frame or the name of the table containing the calculated cluster assignments
The within-cluster sum of squares (which indicates cluster density)
The names of the slots that are available in the idaKMeans object

Value

The idaKMeans function returns an object of class idaKMeans and kmeans.

Examples

## Not run: 

#Create ida data frame
idf <- ida.data.frame("IRIS")

#Create a kmeans model stored in the database as KMEANSMODEL
km <- idaKMeans(idf, id="ID",modelname="KMEANSMODEL") 
	
#Print the model
print(km)

#Predict the model
pred <- predict(km,idf,id="ID")

#Inspect the predictions
head(pred)
	

## End(Not run)

[Package ibmdbR version 1.51.0 Index]