wassersteinCluster {fdWasserstein}R Documentation

Soft clustering of covariance operators.

Description

Computes the soft cluster solutions for different values of the number of clusters K.

Usage

wassersteinCluster(data, grp, 
                   kmin = 2, kmax = 10, 
                   E = -0.75 * (0.95 * log(0.95) + 
                        0.05 * log(0.05)) + 0.25 * log(2), 
                   nstart = 5, nrefine = 5, ntry = 0, 
                   max.iter = 20, tol = 0.001, 
                   nreduced = length(unique(grp)), 
                   nperm = 0, 
                   add.sigma = FALSE, 
                   use.future = FALSE, verbose = TRUE)

trimmedAverageSilhouette(a, plot = TRUE)

Arguments

data

A N times M matrix containing the N sample curves; M denotes the number of points of the grid on which the curves are available.

grp

A vector or factor of length N; a covariance operator is estimated for each level of grp.

kmin, kmax

A pair of integer defining the desired number of clusters. A solution is computed for K=kmin,...,kmax.

E

The desired average entropy.

nstart, nrefine, ntry

The integers used during the initialization search. If ntry=0, then 'ntry' is set to 'round(1+N/K)'.

max.iter

Maximum number of block descend iterations.

tol

Iterations stop when the relative decrease of the objective function in two consecutive iterations is less than 'tol'.

nreduced

The number of covariances used to estimate the cluster barycenters.

nperm

The number of permutation used to approximate the reference distribution of max TASW.

add.sigma

Should the sample covariances be returned?

use.future

Use or not use package 'future' to parallelize the computation? See note.

verbose

If 'verbose==TRUE', information on the progress of the optimization are shown.

a

A list returned by 'wassersteinCluster'.

plot

If 'plot==TRUE', the TASW profile is plotted.

Details

See Masarotto & Masarotto (2023) for the algorithm details.

Value

'wassersteinCluster' returns a list of length kmax-kmin+1. The ith element is a list describing the cluster solution obtained for k=kmin+i-1, and containing:

K, E, eta

the number of cluster, the average entropy and the corresponding value of 'eta';

w

the N times K soft partition matrix;

g

a M times M times K array with the cluster barycenters;

d

a N times K matrices containing the distances between the N sample covariances and the K cluster barycenters;

obj

'obj': the minimum value of the objective function.

The list may have the following attributes:

df

the degree of freedom of the sample operators (a vector). Always present.

sample.covariances

a list contaning the sample operators (as a 3-dimensional array); only present if add.sigma=TRUE;

tasw.test

a list containing the value of maxTASW computed from the data (a scalar), the nperm values of of maxTASW obtained by permutation (a vector), and the corresponding p-value (a scalar); only present if nperm>0.

'trimmedAverageSilhouette' returns a numeric vector with the TASW values.

Note

To distribute the computation on more than a cpu

  1. install the package 'future'

  2. execute in the R session

    • library(future)

    • plan(multissession)

For more options, see the future's documentation

Author(s)

Valentina Masarotto, Guido Masarotto

References

Masarotto, V. & Masarotto, G. (2023) "Covariance-based soft clustering of functional data based on the Wasserstein-Procrustes metric", Scandinavian Journal of Statistics, doi:10.1111/sjos.12692.

Examples


# Example phoneme.R (simplified) from https://doi.org/10.1111/sjos.12692. 
data(phoneme)
# resampling the log-periodograms
# 15 sample covariances for each phoneme
set.seed(12345)
nsubsamples <- 15
n <- 40
gg <- unique(Phoneme)
nphonemes <- length(gg)
N <- n*nsubsamples*nphonemes
M <- NCOL(logPeriodogram)
X <- matrix(NA, N, M)
gr <- integer(N)
r <- 1
first <- 1
last <- n
for (l in gg) {
  for (i in 1:nsubsamples) {
    X[first:last, ] <- logPeriodogram[sample(which(Phoneme==l),n), ]
    gr[first:last] <- r
    r <- r+1
    first <- first+n
    last <- last+n
  }
}
# soft clustering
a <- wassersteinCluster(X, gr)
# how many cluster?
trimmedAverageSilhouette(a)
# the membership weigths show that the
# algorithm reconstructed the five phoneme
w <- ts(a[[4]]$w)
colnames(w) <- paste("Cluster", 1:5)
plot(w, xlab="Sample covariances", main="")


[Package fdWasserstein version 1.0 Index]