R: Random cluster generation with known structure of clusters

cluster.Gen {clusterSim}

R Documentation

Random cluster generation with known structure of clusters

Description

Random cluster generation with known structure of clusters (optionally with noisy variables and outliers)

Usage

cluster.Gen(numObjects=50, means=NULL, cov=NULL, fixedCov=TRUE,
                   model=1, dataType="m",numCategories=NULL, 
                   numNoisyVar=0, numOutliers=0, rangeOutliers=
                   c(1,10), inputType="csv2", inputHeader=TRUE, 
                   inputRowNames=TRUE, outputCsv="", outputCsv2="", 
                   outputColNames=TRUE, outputRowNames=TRUE)

Arguments

`numObjects`	number of objects in each cluster - positive integer value or vector with the same size as nrow(means), e.g. `numObjects=c(50,20)`
`means`	matrix of cluster means (e.g. `means=matrix(c(0,8,0,8),2,2)`). If `means = NULL` matrix should be read from means_<modelNumber>.csv file
`cov`	covariance matrix (the same for each cluster, e.g. `cov=matrix(c(1, 0, 0, 1), 2, 2)`). If `cov=NULL` matrix should be read from cov_<modelNumber>.csv file. Note: you cannot use this argument for generation of clusters with different covariance matrices. Those kind of generation should be done by setting `fixedCov` to `FALSE` and using appropriate model
`model`	model number, `model=1` - no cluster structure. Observations are simulated from uniform distribution over the unit hypercube in number of dimensions (variables) given in `numNoisyVar` argument; `model=2` - means and covariances are taken from arguments `means` and `cov` (see Example 1); `model=3,4,...,20` - see file $R_HOME\library\clusterSim\pdf\clusterGen_details.pdf; `model=21,22,...` - if `fixedCov=TRUE` `means` should be read from means_<modelNumber>.csv and covariance matrix for all clusters should be read from cov_<modelNumber>.csv and if `fixedCov=FALSE` `means` should be read from means_<modelNumber>.csv and covariance matrices should be read separately for each cluster from cov_<modelNumber>_<clusterNumber>.csv
`fixedCov`	if `fixedCov=TRUE` covariance matrix for all clusters is the same and if `fixedCov=FALSE` each cluster is generated from different covariance matrix - see `model`
`dataType`	"m" - metric (ratio, interval), "o" - ordinal, "s" - symbolic interval
`numCategories`	number of categories (for ordinal data only). Positive integer value or vector with the same size as ncol(means) plus number of noisy variables.
`numNoisyVar`	number of noisy variables. For `model=1` it means number of variables
`numOutliers`	number of outliers (for metric and symbolic interval data only). If a positive integer - number of outliers, if value from <0,1> - percentage of outliers in whole data set
`rangeOutliers`	range for outliers (for metric and symbolic interval data only). The default range is [1, 10].The outliers are generated independently for each variable for the whole data set from uniform distribution. The generated values are randomly added to maximum of j-th variable or subtracted from minimum of j-th variable
`inputType`	"csv" - a dot as decimal point or "csv2" - a comma as decimal point in means_<modelNumber>.csv and cov_<modelNumber>.csv files
`inputHeader`	`inputHeader=TRUE` indicates that input files (means_<modelNumber>.csv; cov_<modelNumber...>.csv) contain header row
`inputRowNames`	`inputRowNames=TRUE` indicates that input files (means_<modelNumber>.csv; cov_<modelNumber...>.csv) contain first column with row names or with number of objects (positive integer values)
`outputCsv`	optional, name of csv file with generated data (first column contains id, second - number of cluster and others - data)
`outputCsv2`	optional, name of csv (a comma as decimal point and a semicolon as field separator) file with generated data (first column contains id, second - number of cluster and others - data)
`outputColNames`	`outputColNames=TRUE` indicates that output file (given by `outputCsv` and `outputCsv2` parameters) contains first row with column names
`outputRowNames`	`outputRowNames=TRUE` indicates that output file (given by `outputCsv` and `outputCsv2` parameters) contains a vector of row names

Details

See file $R_HOME\library\clusterSim\pdf\clusterGen_details.pdf for further details

Value

`clusters`	cluster number for each object, for `model=1` each object belongs to its own cluster thus this variable contains objects numbers
`data`	generated data: for metric and ordinal data - matrix with objects in rows and variables in columns; for symbolic interval data three-dimensional structure: first dimension represents object number, second - variable number and third dimension contains lower- and upper-bounds of intervals

Author(s)

Marek Walesiak marek.walesiak@ue.wroc.pl, Andrzej Dudek andrzej.dudek@ue.wroc.pl

Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland

References

Billard, L., Diday, E. (2006), Symbolic data analysis. Conceptual statistics and data mining, Wiley, Chichester. ISBN: 978-0-470-09016-9.

Qiu, W., Joe, H. (2006), Generation of random clusters with specified degree of separation, "Journal of Classification", vol. 23, 315-334. Available at: doi:10.1007/s00357-006-0018-y.

Steinley, D., Henson, R. (2005), OCLUS: an analytic method for generating clusters with known overlap, "Journal of Classification", vol. 22, 221-250. Available at: doi:10.1007/s00357-005-0015-6.

Walesiak, M., Dudek, A. (2008), Identification of noisy variables for nonmetric and symbolic data in cluster analysis, In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Decker (Eds.), Data analysis, machine learning and applications, Springer-Verlag, Berlin, Heidelberg, 85-92.

Walesiak, M. (2016), Uogólniona miara odległości GDM w statystycznej analizie wielowymiarowej z wykorzystaniem programu R. Wydanie 2 poprawione i rozszerzone [The Generalized Distance Measure GDM in multivariate statistical analysis with R], Wydawnictwo Uniwersytetu Ekonomicznego, Wroclaw.

Examples



# Example 1
library(clusterSim)
means <- matrix(c(0,7,0,7),2,2)
cov <- matrix(c(1,0,0,1),2,2)
grnd <- cluster.Gen(numObjects=60,means=means,cov=cov,model=2,
numOutliers=8)
colornames <- c("red","blue","green")
grnd$clusters[grnd$clusters==0]<-length(colornames)
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)

# Example 2
library(clusterSim)
grnd <- cluster.Gen(50,model=4,dataType="m",numNoisyVar=2)
data <- as.matrix(grnd$data)
colornames <- c("red","blue","green")
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)

# Example 3
library(clusterSim)
grnd<-cluster.Gen(50,model=4,dataType="o",numCategories=7, numNoisyVar=2)
plotCategorial(grnd$data,,grnd$clusters,ask=TRUE)

# Example 4 (1 nonnoisy variable and 2 noisy variables, 3 clusters)
library(clusterSim)
grnd <- cluster.Gen(c(40,60,20), model=2, means=c(2,14,25),
cov=c(1.5,1.5,1.5),numNoisyVar=2)
colornames <- c("red","blue","green")
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)

# Example 5
library(clusterSim)
grnd <- cluster.Gen(c(20,35,20,25),model=14,dataType="m",numNoisyVar=1,
fixedCov=FALSE, numOutliers=0.1)
# or 
#grnd <- cluster.Gen(c(20,35,20,25),model=14,dataType="m",numNoisyVar=1,
#fixedCov=FALSE, numOutliers=0.1, outputCsv2="data14.csv")
data <- as.matrix(grnd$data)
colornames <- c("red","blue","green","brown","black")
grnd$clusters[grnd$clusters==0]<-length(colornames)
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)

# Example 6 (this example needs files means_24.csv) 
# and cov_24.csv to be placed in working directory
# library(clusterSim)
# grnd<-cluster.Gen(c(50,80,20),model=24,dataType="m",numNoisyVar=1, 
# numOutliers=10, rangeOutliers=c(1,5))
# print(grnd)
# data <- as.data.frame(grnd$data)
# colornames<-c("red","blue","green","brown")
# grnd$clusters[grnd$clusters==0]<-length(colornames)
# plot(data,col=colornames[grnd$clusters],ask=TRUE)

# Example 7 (this example needs files means_25.csv and cov_25_1.csv) 
# cov_25_2.csv, cov_25_3.csv, cov_25_4.csv, cov_25_5.csv
# to be placed in working directory
# library(clusterSim)
# grnd<-cluster.Gen(c(40,30,20,35,45),model=25,numNoisyVar=3,fixedCov=F)
# data <- as.data.frame(grnd$data)
# colornames<-c("red","blue","green","magenta","brown")
# plot(data,col=colornames[grnd$clusters],ask=TRUE)

[Package clusterSim version 0.51-4 Index]