R: two step clustering

idaTwoStep {ibmdbR}

R Documentation

two step clustering

Description

This function generates a two step clustering model based on the contents of an IDA data frame (ida.data.frame).

Usage

idaTwoStep( data, id, k = 3, maxleaves = 1000, distance = "euclidean", outtable = NULL,
            randseed = 12345, statistics = NULL, maxk = 20, nodecapacity = 6,
            leafcapacity = 8, outlierfraction = 0, modelname = NULL)

## S3 method for class 'idaTwoStep'
print(x,...)  
## S3 method for class 'idaTwoStep'
predict(object, newdata, id,...)

Arguments

`data`	A IDA data frame that contains the input data for the function. The input IDA data frame must include a column that contains a unique ID for each row.
`id`	The name of the column that contains a unique ID for each row of the input data.
`k`	The number of clusters to be calculated.
`maxleaves`	The maximum number of leaf nodes in the initial clustering tree. When the tree contains maxleaves leaf nodes, the following data records are aggregated into clusters associated with the existing leaf nodes. This parameter is available for Db2 for z/OS only and ignored for Db2 Warehouse with integrated Spark.
`maxk`	The maximum number of clusters that can be determined automatically.
`nodecapacity`	The branching factor of the internal tree that is used in pass 1. Each node can have up to <nodecapacity> subnodes.

This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS.

`leafcapacity`	The number of clusters per leaf node in the internal tree that is used in pass 1. This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS.
`outlierfraction`	The fraction of the records that is to be considered as outlier in the internal tree that is used in pass 1. Clusters that contain less than <outlierfraction> times the mean number of data records per cluster are removed. This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS.
`distance`	The distance function that is to be used. This can be set to `"euclidean"`, which causes the squared Euclidean distance to be used, or `"norm_euclidean"`, which causes normalized euclidean distance to be used.
`outtable`	The name of the output table that is to contain the results of the operation. When NULL is specified, a table name is generated automatically.
`randseed`	The seed for the random number generator.
`statistics`	Denotes which statistics to calculate. Allowed values are `"none"`,`"columns"` and `"all"`. If NULL, the default of the database system will be used.
`modelname`	The name under which the model is stored in the database. This is the name that is specified when using functions such as `idaRetrieveModel` or `idaDropModel`.
`object`	An object of the class `idaTwoStep` to be used for prediction, i.e. for applying it to new data.
`x`	An object of the class `idaTwoStep` to be printed.
`newdata`	A IDA data frame that contains the data to which to apply the model.
`...`	Additional parameters to pass to the print or predict method.

Details

The idaTwoStep clustering function distributes first the input data into a hierarchical tree structure according to the distance between the data records where each leaf node corresponds to a (small) cluster. Then idaTwoStep reduces the tree by aggregating the leaf nodes according to the distance function until k clusters remain.

Models are stored persistently in database under the name modelname. Model names cannot have more than 64 characters and cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one model with a given name is allowed in the database at a time. If a model with modelname already exists, you need to drop it with idaDropModel first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel).

The output of the print function for a idaTwoStep object is:

A vector containing a list of centers
A vector containing a list of cluster sizes
A vector containing a list of the number of elements in each cluster
A data frame or the name of the table containing the calculated cluster assignments
The within-cluster sum of squares (which indicates cluster density)
The names of the slots that are available in the idaTwoStep object

Value

The idaTwoStep function returns an object of class idaTwoStep and TwoStep.

Examples

## Not run: 

#Create ida data frame
idf <- ida.data.frame("IRIS")

#Create a TwoStep model stored in the database as TwoStepMODEL
tsm <- idaTwoStep(idf, id="ID",modelname="TwoStepMODEL") 
	
#Print the model
print(tsm)

#Predict the model
pred <- predict(tsm,idf,id="ID")

#Inspect the predictions
head(pred)
	

## End(Not run)

[Package ibmdbR version 1.51.0 Index]