R: Probabilistic Distance Clustering Adjusted for Cluster Size

PDQ {FPDclustering}

R Documentation

Probabilistic Distance Clustering Adjusted for Cluster Size

Description

An implementation of probabilistic distance clustering adjusted for cluster size (PDQ), a probabilistic distance clustering algorithm that involves optimizing the PD-clustering criterion. The algorithm can be used, on continous, count, or mixed type data setting Euclidean, Chi square, or Gower as dissimilarity measurments.

Usage

PDQ(data=NULL,k=2,ini='kmd',dist='euc',cent=NULL,
ord=NULL,cat=NULL,bin=NULL,cont=NULL,w=NULL)

Arguments

`data`	A matrix or data frame such that rows correspond to observations and columns correspond to variables.
`k`	A numerical parameter giving the number of clusters.
`ini`	A parameter that selects center starts. Options available are random ("random"), kmedoid ("kmd", by default"), center ("center", the user inputs the center), and kmode ("kmode", for categoriacal data sets).
`dist`	A parameter that selects the distance measure used. Options available are Eucledean ("euc"), Gower ("gower") and chi square ("chi").
`cent`	User inputted centers if ini is set to "center".
`ord`	column indices of the x matrix indicating which columns are ordinal variables.
`cat`	column indices of the x matrix indicating which columns are categorical variables.
`bin`	column indices of the x matrix indicating which columns are binary variables.
`cont`	column indices of the x matrix indicating which columns are continuous variables.
`w`	numerical vector same length as the columns of the data, containing the variable weights when using Gower distance, equal weights by default.

Value

A class FPDclustering list with components

`label`	A vector of integers indicating the cluster membership for each unit
`centers`	A matrix of cluster centers
`probability`	A matrix of probability of each point belonging to each cluster
`JDF`	The value of the Joint distance function
`iter`	The number of iterations
`jdfvector`	collection of all jdf calculations at each iteration
`data`	the data set

Author(s)

Cristina Tortora and Noe Vidales

References

Iyigun, Cem, and Adi Ben-Israel. Probabilistic distance clustering adjusted for cluster size. Probability in the Engineering and Informational Sciences 22.4 (2008): 603-621. doi.org/10.1017/S0269964808000351.

Tortora and Palumbo. Clustering mixed-type data using a probabilistic distance algorithm. submitted.

Examples


#Mixed type data

sig=matrix(0.7,4,4)
diag(sig)=1###creat a correlation matrix
x1=rmvnorm(200,c(0,0,3,3))##  cluster 1
x2=rmvnorm(200,c(4,4,6,6),sigma=sig)##  cluster 2
x=rbind(x1,x2)# data set with 2 clusters
l=c(rep(1,200),rep(2,200))#creating the labels
x1=cbind(x1,rbinom(200,4,0.2),rbinom(200,4,0.2))#categorical variables
x2=cbind(x2,rbinom(200,4,0.7),rbinom(200,4,0.7))
x=rbind(x1,x2) ##Data set

#### Performing PDQ
pdq_class<-PDQ(data=x,k=2, ini="random", dist="gower", cont= 1:4, cat = 5:6)

###Output
table(l,pdq_class$label)
plot(pdq_class)
summary(pdq_class)



###Continuous data example
# Gaussian Generated Data  no  overlap 
x<-rmvnorm(100, mean=c(1,5,10), sigma=diag(1,3))
y<-rmvnorm(100, mean=c(4,8,13), sigma=diag(1,3))
data<-rbind(x,y)

#### Performing PDQ
pdq1=PDQ(data,2,ini="random",dist="euc")
table(rep(c(2,1),each=100),pdq1$label)
Silh(pdq1$probability)
plot(pdq1)
summary(pdq1)


# Gaussian Generated Data with  overlap 
x2<-rmvnorm(100, mean=c(1,5,10), sigma=diag(1,3))
y2<-rmvnorm(100, mean=c(2,6,11), sigma=diag(1,3))
data2<-rbind(x2,y2)

#### Performing PDQ
pdq2=PDQ(data2,2,ini="random",dist="euc")
table(rep(c(1,2),each=100),pdq2$label)
plot(pdq2)
summary(pdq2)

[Package FPDclustering version 2.3.1 Index]