idaTwoStep {ibmdbR}R Documentation

two step clustering

Description

This function generates a two step clustering model based on the contents of an IDA data frame (ida.data.frame).

Usage

idaTwoStep( data, id, k = 3, maxleaves = 1000, distance = "euclidean", outtable = NULL,
            randseed = 12345, statistics = NULL, maxk = 20, nodecapacity = 6,
            leafcapacity = 8, outlierfraction = 0, modelname = NULL)

## S3 method for class 'idaTwoStep'
print(x,...)  
## S3 method for class 'idaTwoStep'
predict(object, newdata, id,...)  

Arguments

data

A IDA data frame that contains the input data for the function. The input IDA data frame must include a column that contains a unique ID for each row.

id

The name of the column that contains a unique ID for each row of the input data.

k

The number of clusters to be calculated.

maxleaves

The maximum number of leaf nodes in the initial clustering tree. When the tree contains maxleaves leaf nodes, the following data records are aggregated into clusters associated with the existing leaf nodes. This parameter is available for Db2 for z/OS only and ignored for Db2 Warehouse with integrated Spark.

maxk

The maximum number of clusters that can be determined automatically.

nodecapacity

The branching factor of the internal tree that is used in pass 1. Each node can have up to <nodecapacity> subnodes.

This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS.

leafcapacity

The number of clusters per leaf node in the internal tree that is used in pass 1. This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS.

outlierfraction

The fraction of the records that is to be considered as outlier in the internal tree that is used in pass 1. Clusters that contain less than <outlierfraction> times the mean number of data records per cluster are removed. This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS.

distance

The distance function that is to be used. This can be set to "euclidean", which causes the squared Euclidean distance to be used, or "norm_euclidean", which causes normalized euclidean distance to be used.

outtable

The name of the output table that is to contain the results of the operation. When NULL is specified, a table name is generated automatically.

randseed

The seed for the random number generator.

statistics

Denotes which statistics to calculate. Allowed values are "none","columns" and "all". If NULL, the default of the database system will be used.

modelname

The name under which the model is stored in the database. This is the name that is specified when using functions such as idaRetrieveModel or idaDropModel.

object

An object of the class idaTwoStep to be used for prediction, i.e. for applying it to new data.

x

An object of the class idaTwoStep to be printed.

newdata

A IDA data frame that contains the data to which to apply the model.

...

Additional parameters to pass to the print or predict method.

Details

The idaTwoStep clustering function distributes first the input data into a hierarchical tree structure according to the distance between the data records where each leaf node corresponds to a (small) cluster. Then idaTwoStep reduces the tree by aggregating the leaf nodes according to the distance function until k clusters remain.

Models are stored persistently in database under the name modelname. Model names cannot have more than 64 characters and cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one model with a given name is allowed in the database at a time. If a model with modelname already exists, you need to drop it with idaDropModel first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel).

The output of the print function for a idaTwoStep object is:

Value

The idaTwoStep function returns an object of class idaTwoStep and TwoStep.

See Also

idaRetrieveModel, idaDropModel, idaListModels

Examples

## Not run: 

#Create ida data frame
idf <- ida.data.frame("IRIS")

#Create a TwoStep model stored in the database as TwoStepMODEL
tsm <- idaTwoStep(idf, id="ID",modelname="TwoStepMODEL") 
	
#Print the model
print(tsm)

#Predict the model
pred <- predict(tsm,idf,id="ID")

#Inspect the predictions
head(pred)
	

## End(Not run)

[Package ibmdbR version 1.51.0 Index]