R: Computation of Initial Seeds for Kmeans with a Functional...

fdebrik {briKmeans}

R Documentation

Computation of Initial Seeds for Kmeans with a Functional Extension of Brik

Description

fdebrik first fits splines to the multivariate dataset; then it identifies functional centers that form tighter groups, by means of the kma algorithm; finally, it converts these into a multivariate data set in a selected dimension, clusters them and finds the deepest point of each cluster to be used as initial seeds. The multivariate objective function is not necessarily minimised, but better allocations are obtained in general.

Usage

fdebrik(x, k, method="Ward", nstart=1, B = 10, J = 2, x.coord = NULL,
    functionalDist="d0.pearson", OSF = 1, vect = NULL, intercept = TRUE, 
    degPolyn = 3, degFr = 5, knots = NULL, ...)

Arguments

`x`	a data matrix containing `N` observations (individuals) by rows and `d` variables (features) by columns
`k`	number of clusters
`method`	clustering algorithm used to cluster the cluster centres from the bootstrapped replicates; `Ward`, by default. Currently, only `pam` and randomly initialised `kmeans` with `nstart` initializations are implemented
`nstart`	number of random initialisations when using the `kmeans` method to cluster the cluster centres
`B`	number of bootstrap replicates to be generated
`J`	number of observations used to build the bands for the MBD computation. Currently, only the value J=2 can be used
`x.coord`	initial x coordinates (time points) where the functional data is observed; if not provided, it is assumed to be `1:d`
`functionalDist`	similarity measure between functions to be used. Currently, only the cosine of the angles between functions (`"d0.pearson"`) and between their derivatives (`"d1.pearson"`) can be used
`OSF`	oversampling factor for the smoothed data; an OSF of m means that the number of (equally spaced) time points observed in the approximated function is m times the number of original number of features, `d`
`vect`	optional collection of x coordinates (time points) where to assess the smoothed data; if provided, it ignores the OSF
`intercept`	if `TRUE`, an intercept is included in the basis; default is `FALSE`
`degPolyn`	degree of the piecewise polynomial; 3 by default (cubic splines)
`degFr`	degrees of freedom, as in the `bs` function
`knots`	the internal breakpoints that define the spline
`...`	additional arguments to be passed to the `kmeans` function for the final clustering; at this stage `nstart` is set to 1, as the initial seeds are fixed

Details

The FDEBRIk algorithm extends the BRIk algorithm to the case of longitudinal functional data by adding a B-spline fitting step, a collection of functional centers by means of the kma algorithm and the evaluation of these at specific x coordinates. Thus, it allows handling issues such as noisy or missing data. It identifies smoothed initial seeds that are used as starting points of kmeans on the smoothed data. The resulting clustering does not optimise the distortion (sum of squared distances of each data point to its nearest centre) in the original data space but it provides in general a better allocation of datapoints to real groups.

Value

`seeds`	a matrix of size `k x D`, where `D` is either `m x d` or the length of `vect` . It contains the initial smoothed seeds obtained with the FDEBRIk algorithm
`km`	an object of class `kmeans` corresponding to the run of kmeans on the smoothed data, with starting points `seeds`

Author(s)

Javier Albert Smet javas@kth.se and Aurora Torrente etorrent@est-econ.uc3m.es

References

Torrente, A. and Romo, J. Initializing Kmeans Clustering by Bootstrap and Data Depth. J Classif (2021) 38(2):232-256. DOI: 10.1007/s00357-020-09372-3 Albert-Smet, J., Torrente, A. and Romo, J. Modified Band Depth Based Initialization of Kmeans for Functional Data Clustering. Submitted to Adv. Data Anal. Classif. (2022). Sangalli, L.M., Secchi, P., Vantini, V.S. and Vitelli, V. K-mean alignment for curve clustering. Comput. Stat. Data Anal. (2010) 54(5):1219-1233. DOI:10.1016/j.csda.2009.12.008

Examples

## fdebrik algorithm 
    ## Not run: 
    ## simulated data
    set.seed(1)
    x.coord = seq(0,1,0.05)
    x <- matrix(ncol = length(x.coord), nrow = 40)
    labels <- matrix(ncol = 100, nrow = 1)
  
    centers <-  matrix(ncol = length(x.coord), nrow = 4)
    centers[1, ] <- abs(x.coord)-0.5
    centers[2, ] <- (abs(x.coord-0.5))^2 - 0.8
    centers[3, ] <- -(abs(x.coord-0.5))^2 + 0.7
    centers[4, ] <- 0.75*sin(8*pi*abs(x.coord))
  
    for(i in 1:4){
        for(j in 1:10){
            labels[10*(i-1) + j] <- i  
            if(i == 1){x[10*(i-1) + j, ] <- abs(x.coord)-0.5 + 
                rnorm(length(x.coord),0,1.5)}
            if(i == 2){x[10*(i-1) + j, ] <- (abs(x.coord-0.5))^2 - 0.8 + 
                rnorm(length(x.coord),0,1.5)}
            if(i == 3){x[10*(i-1) + j, ] <- -(abs(x.coord-0.5))^2 + 0.7 + 
                rnorm(length(x.coord),0,1.5)}
            if(i == 4){x[10*(i-1) + j, ] <- 0.75*sin(8*pi*abs(x.coord)) + 
                rnorm(length(x.coord),0,1.5)}
            }
        }

    C1 <- kmeans(x,4)
    C2 <- fdebrik(x,4,B=5)
  
    table(C1$cluster, labels)
    table(C2$km$cluster, labels)
    
## End(Not run)

[Package briKmeans version 1.0 Index]