R: Computes trimmed clustering with scatter restrictions

tclustfsda {fsdaR}

R Documentation

Computes trimmed clustering with scatter restrictions

Description

Partitions the points in the n-by-v data matrix Y into k clusters. This partition minimizes the trimmed sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances. Rows of Y correspond to points, columns correspond to variables. Returns in the output object of class tclustfsda.object an n-by-1 vector idx containing the cluster indices of each point. By default, tclustfsda() uses (squared), possibly constrained, Mahalanobis distances.

Usage

tclustfsda(
  x,
  k,
  alpha,
  restrfactor = 12,
  monitoring = FALSE,
  plot = FALSE,
  nsamp,
  refsteps = 15,
  reftol = 1e-13,
  equalweights = FALSE,
  mixt = 0,
  msg = FALSE,
  nocheck = FALSE,
  startv1 = 1,
  RandNumbForNini,
  restrtype = c("eigen", "deter"),
  UnitsSameGroup,
  numpool,
  cleanpool,
  trace = FALSE,
  ...
)

Arguments

`x`	An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
`k`	Number of groups.
`alpha`	A scalar between 0 and 0.5 or an integer specifying the number of observations which have to be trimmed. If `alpha=0`, `tclust` reduces to traditional model based or mixture clustering (mclust): see for example the Matlab function `gmdistribution`. More in detail, if `0 < alpha < 1` clustering is based on `h = floor(n * (1-alpha))` observations, else if alpha is an integer greater than 1 clustering is based on `h = n - floor(alpha)`. If `monitoring=TRUE`, `alpha` is a vector which specifies the values of trimming levels which have to be considered - contains decresing elements which lie in the interval 0 and 0.5. For example if `alpha=c(0.1, 0.05, 0)`, `tclust()` considers these 3 values of trimming level. The default for alpha is vector `alpha=c(0.1, 0.05, 0)`. The sequence is forced to be monotonically decreasing.
`restrfactor`	Positive scalar which constrains the allowed differences among group scatters. Larger values imply larger differences of group scatters. On the other hand a value of 1 specifies the strongest restriction forcing all eigenvalues/determinants to be equal and so the method looks for similarly scattered (respectively spherical) clusters. The default is to apply `restrfactor` to eigenvalues. In order to apply `restrfactor` to determinants it is necessary to use the optional input argument `restrtype`.
`monitoring`	If `monitoring=TRUE` TCLUST is performed for a series of values of the trimming factor `alpha` given `k` (number of groups) and given `c` (restriction factor).
`plot`	If `plot=FALSE` (default) or `plot=0` no plot is produced. If `plot=TRUE` and `monitoring=FALSE` a plot with the classification is shown (using the spmplot function). The plot can be: for `p = 1`, a histogram of the univariate data, for `p = 2`, a bivariate scatterplot, for `p > 2`, a scatterplot matrix generated by the MATLAB function `spmplot()`. When `p >= 2` the following additional features are offered (for `p = 1` the behaviour is forced to be as for `plots=TRUE`): `plot = 'contourf'` adds in the background of the bivariate scatterplots a filled contour plot. The colormap of the filled contour is based on grey levels as default. This argument may also be inserted in a field named 'type' of a list. In the latter case it is possible to specify the additional field 'cmap', which changes the default colors of the color map used. The field 'cmap' may be a three-column matrix of values in the range [0,1] where each row is an RGB triplet that defines one color. Check the colormap function for additional informations. `plot = 'contour'` adds in the background of the bivariate scatterplots a contour plot. The colormap of the contour is based on grey levels as default. This argument may also be inserted in a field named `type` of a list. In the latter case it is possible to specify the additional field `cmap`, which changes the default colors of the color map used. The field `cmap` may be a three-column matrix of values in the range [0,1] where each row is an RGB triplet that defines one color. Check the `colormap()` (MATLAB) function for additional information. `plot = 'ellipse'` superimposes confidence ellipses to each group in the bivariate scatterplots. The size of the ellipse is `qchisq(0.95, 2)`, i.e. the confidence level used by default is 95 percent. This argument may also be inserted in a field named `type` of a list. In the latter case it is possible to specify the additional field `conflev`, which specifies the confidence level to use and it is a value between 0 and 1. `plot = 'boxplotb'` superimposes on the bivariate scatterplots the bivariate boxplots for each group, using the boxplotb function. This argument may also be inserted in a field named `type` of a list. The parameter `plot` can be also a list and in this case its elements are: `type` - specifies the type of plot as when plot option is a character. Therefore, plots$type can be one of 'contourf', 'contour', 'ellipse' or 'boxplotb'. `cmap` - used to set a colormap for the plot type (MATLAB style). For example, plot$cmap = 'autumn'. See the MATLAB help of colormap for a list of colormap possiblilites. `conflev` - this is the confidence level for the confidence ellipses. It must me a scalar between 0 and 1. If `plot=TRUE` and `monitoring=TRUE` two plots are shown. The first plot (monitor plot) shows three panels with the monitoring between two consecutive values of alpha: (i) the change in classification using ARI index (top panel), (ii) the change in centroids using squared euclidean distances (central panel) and (iii) the change in covariance matrices using squared euclidean distance (bottom panel). The second plot (gscatter plot) shows a series of subplots which monitor the classification for each value of `alpha`. In order to make sure that consistent labels are used for the groups, between two consecutive values of `alpha`, we assign label `r` to a group if this group shows the smallest distance with group `r` for the previous value of `alpha`. The type of plot which is used to monitor the stability of the classification depends on the data dimensionality `p`. for `p = 1`, a histogram of the univariate data (the MATLAB function `histFS()` is called), for `p = 2`, a bivariate scatterplot (the MATLAB function `gscatter()` is called), for `p > 2`, a scatterplot of the first two principal components (function `gscatter()` is called and we show on the axes titles the percentage of variance explained by the first two principal components). Also in the case of `monitoring=TRUE` the parameter `plot` can be a list and its elements are: name: character vector which enables to specify which plot to display. `name = "gscatter"` produces a figure with a series of subplots which show the classification for each value of `alpha`. `name = "monitor"` shows a figure with three panels which monitor between two consecutive values of alpha the change in classification using ARI index (top panel), the change in centroids using squared euclidean distances (central panel), the change in covariance matrices using squared euclidean distance (bottom panel). If this field is not specified, by default `name=c("gscatter", "monitor")` and both figures will be shown. alphasel: a numeric vector which specifies for which values of alpha it is possible to see the classification. For example if `alphasel = c(0.05, 0.02)`, the classification will be shown just for `alpha=0.05` and `alpha=0.02`. If this field is not specified `alphasel=alpha` and therefore the classification is shown for each value of alpha.
`nsamp`	If a scalar, it contains the number of subsamples which will be extracted. If `nsamp = 0` all subsets will be extracted. Remark - if the number of all possible subset is greater than 300 the default is to extract all subsets, otherwise just 300. If `nsamp` is a matrix it contains in the rows the indexes of the subsets which have to be extracted. `nsamp` in this case can be conveniently generated by function `subsets()`. `nsamp` can have `k` columns or `k * (p + 1)` columns. If `nsamp` has `k` columns the `k` initial centroids each iteration i are given by `X[nsamp[i,] ,]` and the covariance matrices are equal to the identity. If `nsamp` has `k * (p + 1)` columns, the initial centroids and covariance matrices in iteration `i` are computed as follows: X1 <- X[nsamp[i ,] ,] mean(X1[1:p + 1, ]) contains the initial centroid for group 1 cov(X1[1:p + 1, ]) contains the initial cov matrix for group 1 mean(X1[(p + 2):(2p + 2), ]) contains the initial centroid for group 2 cov(X1[(p + 2):(2p + 2), ]) contains the initial cov matrix for group 2 ... mean(X1[(k-1)p+1):(k(p+1), ]) contains the initial centroids for group k cov(X1[(k-1)p+1):(k(p+1), ]) contains the initial cov matrix for group k. REMARK: If `nsamp` is not a scalar, the option `startv1` given below is ignored. More precisely, if `nsamp` has `k` columns `startv1 = 0` else if `nsamp` has `k*(p+1)` columns option `startv1=1`.
`refsteps`	Number of refining iterations in each subsample. Default is `refsteps=15`. `refsteps = 0` means "raw-subsampling" without iterations.
`reftol`	Tolerance of the refining steps. The default value is 1e-14
`equalweights`	A logical specifying wheather cluster weights in the concentration and assignment steps shall be considered. If `equalweights=TRUE` we are (ideally) assuming equally sized groups, else if `equalweights = false` (default) we allow for different group weights. Please, check in the given references which functions are maximized in both cases.
`mixt`	Specifies whether mixture modelling or crisp assignment approach to model based clustering must be used. In the case of mixture modelling parameter mixt also controls which is the criterion to find the untrimmed units in each step of the maximization. If `mixt >=1` mixture modelling is assumed else crisp assignment. The default value is `mixt=0`, i.e. crisp assignment. Please see for details the provided references. The parameter `mixt` also controls the criterion to select the units to trim. If `mixt = 2` the `h` units are those which give the largest contribution to the likelihood, else if `mixt=1` the criterion to select the `h` units is exactly the same as the one which is used in crisp assignment.
`msg`	Controls whether to display or not messages on the screen. If `msg==TRUE` messages are displayed on the screen. If `msg=2`, detailed messages are displayed, for example the information at iteration level.
`nocheck`	Check input arguments. If `nocheck=TRUE` no check is performed on matrix `X`. The default `nocheck=FALSE`.
`startv1`	How to initialize centroids and covariance matrices. Scalar. If `startv1=1` then initial centroids and covariance matrices are based on `(p+1)` observations randomly chosen, else each centroid is initialized taking a random row of input data matrix and covariance matrices are initialized with identity matrices. The default value is`startv1=1`. Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied. Remark 2 - option `startv1` is used just if `nsamp` is a scalar (see for more details the help associated with `nsamp`).
`RandNumbForNini`	pre-extracted random numbers to initialize proportions. Matrix of size k-by-nrow(nsamp) containing the random numbers which are used to initialize the proportions of the groups. This option is effective just if `nsamp` is a matrix which contains pre-extracted subsamples. The purpose of this option is to enable to user to replicate the results in case routine `tclustreg*()` is called using a parfor instruction (as it happens for example in routine IC, where `tclustreg()` is called through a parfor for different values of the restriction factor). The default is that `RandNumbForNini` is empty - then uniform random numbers are used.
`restrtype`	Type of restriction to be applied on the cluster scatter matrices. Valid values are `'eigen'` (default), or `'deter'`. `"eigen"` implies restriction on the eigenvalues while `"deter"` implies restriction on the determinants.
`UnitsSameGroup`	List of the units which must (whenever possible) have a particular label. For example `UnitsSameGroup=c(20, 26)`, means that group which contains unit 20 is always labelled with number 1. Similarly, the group which contains unit 26 is always labelled with number 2, (unless it is found that unit 26 already belongs to group 1). In general, group which contains unit `UnitsSameGroup(r)` where `r=2, ...length(kk)-1` is labelled with number `r` (unless it is found that unit `UnitsSameGroup(r)` has already been assigned to groups `1, 2, ..., r-1`.
`numpool`	The number of parallel sessions to open. If numpool is not defined, then it is set equal to the number of physical cores in the computer.
`cleanpool`	Logical, indicating if the open pool must be closed or not. It is useful to leave it open if there are subsequent parallel sessions to execute, so that to save the time required to open a new pool.
`trace`	Whether to print intermediate results. Default is `trace=FALSE`.
`...`	potential further arguments passed to lower level functions.

Details

This iterative algorithm initializes k clusters randomly and performs concentration steps in order to improve the current cluster assignment. The number of maximum concentration steps to be performed is given by input parameter refsteps. For approximately obtaining the global optimum, the system is initialized nsamp times and concentration steps are performed until convergence or refsteps is reached. When processing more complex data sets higher values of nsamp and refsteps have to be specified (obviously implying extra computation time). However, if more then 10 per cent of the iterations do not converge, a warning message is issued, indicating that nsamp has to be increased.

Value

Depending on the input parameter monitoring, one of the following objects will be returned:

tclustfsda.object
tclusteda.object

Author(s)

FSDA team, valentin.todorov@chello.at

References

Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008). A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics, Vol. 36, 1324-1345. doi:10.1214/07-AOS515.

Examples

 ## Not run: 

 data(hbk, package="robustbase")
 (out <- tclustfsda(hbk[, 1:3], k=2))
 class(out)
 summary(out)

 ##  TCLUST of Gayser data with three groups (k=3), 10%% trimming (alpha=0.1)
 ##      and restriction factor (c=10000)
 data(geyser2)
 (out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000))

 ## Use the plot options to produce more complex plots ----------

 ##  Plot with all default options
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=TRUE)

 ##  Default confidence ellipses.
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="ellipse")

 ##  Confidence ellipses specified by the user: confidence ellipses set to 0.5
 plots <- list(type="ellipse", conflev=0.5)
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)

 ##  Confidence ellipses set to 0.9
 plots <- list(type="ellipse", conflev=0.9)
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)

 ##  Contour plots
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="contour")

 ##  Filled contour plots with additional options: contourf plot with a named colormap.
 ##  Here we define four MATLAB-like colormaps, but the user can define anything else,
 ##  presented by a matrix with three columns which are the RGB triplets.

 summer <- as.matrix(data.frame(x1=seq(from=0, to=1, length=65),
                                x2=seq(from=0.5, to=1, length=65),
                                x3=rep(0.4, 65)))
 spring <- as.matrix(data.frame(x1=rep(1, 65),
                                x2=seq(from=0, to=1, length=65),
                                x3=seq(from=1, to=0, length=65)))
 winter <- as.matrix(data.frame(x1=rep(0, 65),
                                x2=seq(from=0, to=1, length=65),
                                x3=seq(from=1, to=0, length=65)))
 autumn <- as.matrix(data.frame(x1=rep(1, 65),
                                x2=seq(from=0, to=1, length=65),
                                x3=rep(0, 65)))

 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
       plot=list(type="contourf", cmap=autumn))
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
       plot=list(type="contourf", cmap=winter))
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
       plot=list(type="contourf", cmap=spring))
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
       plot=list(type="contourf", cmap=summer))


 ##  We compare the output using three different values of restriction factor
 ##      nsamp is the number of subsamples which will be extracted
 data(geyser2)
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, nsamp=500, plot="ellipse")
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10, nsamp=500, refsteps=10, plot="ellipse")
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=1, nsamp=500, refsteps=10, plot="ellipse")

 ##  TCLUST applied to M5 data: A bivariate data set obtained from three normal
 ##  bivariate distributions with different scales and proportions 1:2:2. One of the
 ##  components is very overlapped with another one. A 10 per cent background noise is
 ##  added uniformly distributed in a rectangle containing the three normal components
 ##  and not very overlapped with the three mixture components. A precise description
 ##  of the M5 data set can be found in Garcia-Escudero et al. (2008).
 ##

 data(M5data)
 plot(M5data[, 1:2])

 ##  Scatter plot matrix
 library(rrcov)
 plot(CovClassic(M5data[,1:2]), which="pairs")

 out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=1000, nsamp=100, plot=TRUE)
 out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=10, nsamp=100, plot=TRUE)
 out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1, nsamp=1000,
         plot=TRUE, equalweights=TRUE)
 out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1000, nsamp=100, plot=TRUE)

 ##  TCLUST with simulated data: 5 groups and 5 variables
 ##
 n1 <- 100
 n2 <- 80
 n3 <- 50
 n4 <- 80
 n5 <- 70
 p <- 5
 Y1 <- matrix(rnorm(n1 * p) + 5, ncol=p)
 Y2 <- matrix(rnorm(n2 * p) + 3, ncol=p)
 Y3 <- matrix(rnorm(n3 * p) - 2, ncol=p)
 Y4 <- matrix(rnorm(n4 * p) + 2, ncol=p)
 Y5 <- matrix(rnorm(n5 * p), ncol=p)

 group <- c(rep(1, n1), rep(2, n2), rep(3, n3), rep(4, n4), rep(5, n5))
 Y <- Y1
 Y <- rbind(Y, Y2)
 Y <- rbind(Y, Y3)
 Y <- rbind(Y, Y4)
 Y <- rbind(Y, Y5)
 dim(Y)
 table(group)
 out <- tclustfsda(Y, k=5, alpha=0.05, restrfactor=1.3, refsteps=20, plot=TRUE)

 ##  Automatic choice of best number of groups for Geyser data ------------------------
 ##
 data(geyser2)
 maxk <- 6
 CLACLA <- matrix(0, nrow=maxk, ncol=2)
 CLACLA[,1] <- 1:maxk
 MIXCLA <- MIXMIX <- CLACLA

 for(j in 1:maxk) {
     out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5)
     CLACLA[j, 2] <- out$CLACLA
 }

 for(j in 1:maxk) {
     out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5, mixt=2)
     MIXMIX[j ,2] <- out$MIXMIX
     MIXCLA[j, 2] <- out$MIXCLA
 }

 oldpar <- par(mfrow=c(1,3))
 plot(CLACLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="CLACLA")
 plot(MIXMIX[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXMIX")
 plot(MIXCLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXCLA")
 par(oldpar)


 ##  Monitoring examples ------------------------------------------

 ##  Monitoring using Geyser data

 ##  Monitoring using Geyser data (all default options)
 ##  alpha and restriction factor are not specified therefore alpha=c(0.10, 0.05, 0)
 ##  is used while the restriction factor is set to c=12
 out <- tclustfsda(geyser2, k=3, monitoring=TRUE)

 ##  Monitoring using Geyser data with alpha and c specified.
 out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.10, 0, by=-0.01), monitoring=TRUE)

 ##  Monitoring using Geyser data with plot argument specified as a list.
 ##      The trimming levels to consider in this case are 31 values of alpha
 ##
 out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,
         plot=list(alphasel=c(0.2, 0.10, 0.05, 0.01)), trace=TRUE)

 ##  Monitoring using Geyser data with argument UnitsSameGroup
 ##
 ##      Make sure that group containing unit 10 is in a group which is labelled "group 1"
 ##      and group containing unit 12 is in group which is labelled "group 2"
 ##
 ##      Mixture model is used (mixt=2)
 ##
 out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,
         mixt=2, UnitsSameGroup=c(10, 12), trace=TRUE)

 ##  Monitoring using M5 data
 data(M5data)

 ##  alphavec=vector which contains the trimming levels to consider
 ##  in this case 31 values of alpha are considered
 alphavec <- seq(0.10, 0, by=-0.02)
 out <- tclustfsda(M5data[, 1:2], 3, alpha=alphavec, restrfac=1000, monitoring=TRUE,
     nsamp=1000, plots=TRUE)
 
## End(Not run)

[Package fsdaR version 0.9-0 Index]