CoClust {CoClust} | R Documentation |
Copula-Based Clustering Algorithm
Description
Cluster analysis based on copula functions
Usage
CoClust(m, dimset = 2:5, noc = 4, copula = "frank", fun = median,
method.ma = c("empirical", "pseudo"), method.c = c("ml", "mpl", "irho", "itau"),
dfree = NULL, writeout = 5, penalty = c("BICk", "AICk", "LL"), ...)
Arguments
m |
a data matrix. |
dimset |
the set of dimensions for which the function tries the clustering. |
noc |
sample size of the set for selecting the number of clusters. |
copula |
a copula model. This should be one of "normal", "t", "frank", "clayton" and "gumbel". See the Details section. |
fun |
combination function of the pairwise Spearman's rho used to select the k-plets. The default is |
method.ma |
estimation method for margins. See the Details section. |
method.c |
estimation method for copula. See |
dfree |
degrees of freedom for the t copula. |
writeout |
writes a message on the number of allocated observations every writeout observations. |
penalty |
Specifies the likelihood criterion used for selecting the number of clusters. |
... |
further parameters for |
Details
- Usage for Frank copula:
-
CoClust(m, nmaxmarg = 2:5, noc = 4, copula = "frank", fun = median, method.ma=c("gaussian","empirical"), method.c = "mpl", penalty ="BICk", ...)
CoClust is a clustering algorithm that, being based on copula functions, allows to group observations according to the multivariate dependence structure of the generating process without any assumptions on the margins.
For each k in dimset
the algorithm builds a sample of noc
observations (rows of the data matrix m
) by using the matrix of Spearman's rho correlation coefficients which are combined by means of the function fun
(median
by default).
The number of clusters K is selected by means of a criterion based on the likelihood of the copula fit. The switch penalty
allows to select 3 different criteria; The choice LL
corresponds to using
the likelihood without penalty terms. Then, the remaining observations are allocated to the clusters as follows:
1. selects a K-plet of observations on the basis of fun
applied to the pairwise Spearman's rho; 2. allocates or discards the K-plet on the basis of the likelihood of the copula fit.
The estimation approach for the copula fit is semiparametric: a range of nonparametric margins and parametric copula models can be selected by the user. The CoClust algorithm does not require to set a priori the number of clusters nor it needs a starting classification.
Notice that the dependence structure for the Gaussian and the t copula is set to exchangeable. Non structured dependence structures will be allowed in a future version.
Value
An object of S4 class "CoClust", which is a list with the following elements:
Number.of.Clusters |
the number K of identified clusters. | |||||||||
Index.Matrix |
a n.obs by (K+1) matrix where n.obs is the number of observations put in each cluster. The matrix contains the row indexes of the observations of the data matrix | |||||||||
Data.Clusters |
the matrix of the final clustering. | |||||||||
Dependence |
a list containing:
| |||||||||
LogLik |
the maximized log-likelihood copula fit. | |||||||||
Est.Method |
the estimation method used for the copula fit. | |||||||||
Opt.Method |
the optimization method used for the copula fit. | |||||||||
LLC |
the value of the LogLikelihood Criterion for each k in | |||||||||
Index.dimset |
a list that, for each k in |
Note
The final clustering is composed of K groups in which observations of the same group are independent whereas the observations that belong to different groups and that form a K-plet are dependent.
Author(s)
Francesca Marta Lilja Di Lascio <marta.dilascio@unibz.it>,
Simone Giannerini <simone.giannerini@unibo.it>
References
Di Lascio, F.M.L. (201x). "CoClust: An R Package for Copula-based Cluster Analysis". To be submitted.
Di Lascio, F.M.L., Durante, F. and Pappada', R. (2017). "Copula-based clustering methods", Copulas and Dependence Models with Applications, p.49-67. Eds Ubeda-Flores, M., de Amo, E., Durante, F. and Fernandez Sanchez, J., Springer International Publishing. ISBN: 978-3-319-64220-8.
Di Lascio, F.M.L. and Disegna, M. (2017). "A copula-based clustering algorithm to analyse EU country diets". Knowledge-Based Systems, 132, p.72-84. DOI: 10.1016/j.knosys.2017.06.004.
Di Lascio, F.M.L. and Giannerini, S. (2016). "Clustering dependent observations with copula functions". Statistical Papers, p.1-17. DOI 10.1007/s00362-016-0822-3.
Di Lascio, F.M.L. and Giannerini, S. (2012). "A Copula-Based Algorithm for Discovering Patterns of Dependent Observations", Journal of Classification, 29(1), p.50-75.
Di Lascio, F.M.L. (2008). "Analyzing the dependence structure of microarray data: a copula-based approach". PhD thesis, Dipartimento di Scienze Statistiche, Universita' di Bologna, Italy.
Examples
## ******************************************************************
## 1. builds a 3-variate copula with different margins
## (Gaussian, Gamma, Beta)
##
## 2. generates a data matrix xm with 15 rows and 21 columns and
## builds the matrix of the true cluster indexes
##
## 3. applies the CoClust to the rows of xm and recovers the
## multivariate dependence structure of the data
## ******************************************************************
## Step 1. **********************************************************
n <- 105 # total number of observations
n.col <- 21 # number of columns of the data matrix m
n.marg <- 3 # dimension of the copula
n.row <- n*n.marg/n.col # number of rows of the data matrix m
theta <- 10
copula <- frankCopula(theta, dim = n.marg)
mymvdc <- mvdc(copula, c("norm", "gamma", "beta"),list(list(mean=7, sd=2),
list(shape=3, rate=4), list(shape1=2, shape2=1)))
## Step 2. **********************************************************
set.seed(11)
x.samp <- rMvdc(n, mymvdc)
xm <- matrix(x.samp, nrow = n.row, ncol = n.col, byrow=TRUE)
index.true <- matrix(1:15,5,3)
colnames(index.true) <- c("Cluster 1","Cluster 2", "Cluster 3")
## Step 3. **********************************************************
clust <- CoClust(xm, dimset = 2:4, noc=2, copula="frank",
method.ma="empirical", method.c="ml",writeout=1)
clust
clust@"Number.of.Clusters"
clust@"Dependence"$Param
clust@"Data.Clusters"
index.clust <- clust@"Index.Matrix"
## compare with index.true
index.clust
index.true
##