cluster.Gen {clusterSim}R Documentation

Random cluster generation with known structure of clusters

Description

Random cluster generation with known structure of clusters (optionally with noisy variables and outliers)

Usage

cluster.Gen(numObjects=50, means=NULL, cov=NULL, fixedCov=TRUE,
                   model=1, dataType="m",numCategories=NULL, 
                   numNoisyVar=0, numOutliers=0, rangeOutliers=
                   c(1,10), inputType="csv2", inputHeader=TRUE, 
                   inputRowNames=TRUE, outputCsv="", outputCsv2="", 
                   outputColNames=TRUE, outputRowNames=TRUE)

Arguments

numObjects

number of objects in each cluster - positive integer value or vector with the same size as nrow(means), e.g. numObjects=c(50,20)

means

matrix of cluster means (e.g. means=matrix(c(0,8,0,8),2,2)). If means = NULL matrix should be read from means_<modelNumber>.csv file

cov

covariance matrix (the same for each cluster, e.g. cov=matrix(c(1, 0, 0, 1), 2, 2)). If cov=NULL matrix should be read from

cov_<modelNumber>.csv file. Note: you cannot use this argument for generation of clusters with different covariance matrices. Those kind of generation should be done by setting fixedCov to FALSE and using appropriate model

model

model number, model=1 - no cluster structure. Observations are simulated from uniform distribution over the unit hypercube in number of dimensions (variables) given in numNoisyVar argument;

model=2 - means and covariances are taken from arguments means and cov (see Example 1);

model=3,4,...,20 - see file

$R_HOME\library\clusterSim\pdf\clusterGen_details.pdf;

model=21,22,... - if fixedCov=TRUE means should be read from

means_<modelNumber>.csv and covariance matrix for all clusters should be read from cov_<modelNumber>.csv and if fixedCov=FALSE means should be read from

means_<modelNumber>.csv and covariance matrices should be read separately for each cluster from cov_<modelNumber>_<clusterNumber>.csv

fixedCov

if fixedCov=TRUE covariance matrix for all clusters is the same and if

fixedCov=FALSE each cluster is generated from different covariance matrix - see model

dataType

"m" - metric (ratio, interval), "o" - ordinal, "s" - symbolic interval

numCategories

number of categories (for ordinal data only). Positive integer value or vector with the same size as ncol(means) plus number of noisy variables.

numNoisyVar

number of noisy variables. For model=1 it means number of variables

numOutliers

number of outliers (for metric and symbolic interval data only). If a positive integer - number of outliers, if value from <0,1> - percentage of outliers in whole data set

rangeOutliers

range for outliers (for metric and symbolic interval data only). The default range is [1, 10].The outliers are generated independently for each variable for the whole data set from uniform distribution. The generated values are randomly added to maximum of j-th variable or subtracted from minimum of j-th variable

inputType

"csv" - a dot as decimal point or "csv2" - a comma as decimal point in

means_<modelNumber>.csv and cov_<modelNumber>.csv files

inputHeader

inputHeader=TRUE indicates that input files (means_<modelNumber>.csv;

cov_<modelNumber...>.csv) contain header row

inputRowNames

inputRowNames=TRUE indicates that input files (means_<modelNumber>.csv; cov_<modelNumber...>.csv) contain first column with row names or with number of objects (positive integer values)

outputCsv

optional, name of csv file with generated data (first column contains id, second - number of cluster and others - data)

outputCsv2

optional, name of csv (a comma as decimal point and a semicolon as field separator) file with generated data (first column contains id, second - number of cluster and others - data)

outputColNames

outputColNames=TRUE indicates that output file (given by outputCsv and outputCsv2 parameters) contains first row with column names

outputRowNames

outputRowNames=TRUE indicates that output file (given by outputCsv and outputCsv2 parameters) contains a vector of row names

Details

See file $R_HOME\library\clusterSim\pdf\clusterGen_details.pdf for further details

Value

clusters

cluster number for each object, for model=1 each object belongs to its own cluster thus this variable contains objects numbers

data

generated data: for metric and ordinal data - matrix with objects in rows and variables in columns; for symbolic interval data three-dimensional structure: first dimension represents object number, second - variable number and third dimension contains lower- and upper-bounds of intervals

Author(s)

Marek Walesiak marek.walesiak@ue.wroc.pl, Andrzej Dudek andrzej.dudek@ue.wroc.pl

Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/clusterSim/

References

Billard, L., Diday, E. (2006), Symbolic data analysis. Conceptual statistics and data mining, Wiley, Chichester. ISBN: 978-0-470-09016-9.

Qiu, W., Joe, H. (2006), Generation of random clusters with specified degree of separation, "Journal of Classification", vol. 23, 315-334. Available at: doi:10.1007/s00357-006-0018-y.

Steinley, D., Henson, R. (2005), OCLUS: an analytic method for generating clusters with known overlap, "Journal of Classification", vol. 22, 221-250. Available at: \doi10.1007/s00357-005-0015-6.

Walesiak, M., Dudek, A. (2008), Identification of noisy variables for nonmetric and symbolic data in cluster analysis, In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Decker (Eds.), Data analysis, machine learning and applications, Springer-Verlag, Berlin, Heidelberg, 85-92. Available at: http://keii.ue.wroc.pl/pracownicy/mw/2008_Walesiak_Dudek_Springer.pdf.

Walesiak, M. (2016), Uogólniona miara odległości GDM w statystycznej analizie wielowymiarowej z wykorzystaniem programu R. Wydanie 2 poprawione i rozszerzone [The Generalized Distance Measure GDM in multivariate statistical analysis with R], Wydawnictwo Uniwersytetu Ekonomicznego, Wroclaw. Available at: http://keii.ue.wroc.pl/pracownicy/mw/2016_Walesiak_Uogolniona_miara_odleglosci_GDM.pdf.

Examples



# Example 1
library(clusterSim)
means <- matrix(c(0,7,0,7),2,2)
cov <- matrix(c(1,0,0,1),2,2)
grnd <- cluster.Gen(numObjects=60,means=means,cov=cov,model=2,
numOutliers=8)
colornames <- c("red","blue","green")
grnd$clusters[grnd$clusters==0]<-length(colornames)
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)

# Example 2
library(clusterSim)
grnd <- cluster.Gen(50,model=4,dataType="m",numNoisyVar=2)
data <- as.matrix(grnd$data)
colornames <- c("red","blue","green")
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)

# Example 3
library(clusterSim)
grnd<-cluster.Gen(50,model=4,dataType="o",numCategories=7, numNoisyVar=2)
plotCategorial(grnd$data,,grnd$clusters,ask=TRUE)

# Example 4 (1 nonnoisy variable and 2 noisy variables, 3 clusters)
library(clusterSim)
grnd <- cluster.Gen(c(40,60,20), model=2, means=c(2,14,25),
cov=c(1.5,1.5,1.5),numNoisyVar=2)
colornames <- c("red","blue","green")
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)

# Example 5
library(clusterSim)
grnd <- cluster.Gen(c(20,35,20,25),model=14,dataType="m",numNoisyVar=1,
fixedCov=FALSE, numOutliers=0.1)
# or 
#grnd <- cluster.Gen(c(20,35,20,25),model=14,dataType="m",numNoisyVar=1,
#fixedCov=FALSE, numOutliers=0.1, outputCsv2="data14.csv")
data <- as.matrix(grnd$data)
colornames <- c("red","blue","green","brown","black")
grnd$clusters[grnd$clusters==0]<-length(colornames)
plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE)

# Example 6 (this example needs files means_24.csv) 
# and cov_24.csv to be placed in working directory
# library(clusterSim)
# grnd<-cluster.Gen(c(50,80,20),model=24,dataType="m",numNoisyVar=1, 
# numOutliers=10, rangeOutliers=c(1,5))
# print(grnd)
# data <- as.data.frame(grnd$data)
# colornames<-c("red","blue","green","brown")
# grnd$clusters[grnd$clusters==0]<-length(colornames)
# plot(data,col=colornames[grnd$clusters],ask=TRUE)

# Example 7 (this example needs files means_25.csv and cov_25_1.csv) 
# cov_25_2.csv, cov_25_3.csv, cov_25_4.csv, cov_25_5.csv
# to be placed in working directory
# library(clusterSim)
# grnd<-cluster.Gen(c(40,30,20,35,45),model=25,numNoisyVar=3,fixedCov=F)
# data <- as.data.frame(grnd$data)
# colornames<-c("red","blue","green","magenta","brown")
# plot(data,col=colornames[grnd$clusters],ask=TRUE)

[Package clusterSim version 0.51-3 Index]