tclustfsda {fsdaR} | R Documentation |
Computes trimmed clustering with scatter restrictions
Description
Partitions the points in the n-by-v data matrix
Y
into k
clusters. This partition minimizes the trimmed sum,
over all clusters, of the within-cluster sums of point-to-cluster-centroid
distances. Rows of Y correspond to points, columns correspond to variables.
Returns in the output object of class tclustfsda.object
an n-by-1 vector
idx
containing the cluster indices of each point. By default,
tclustfsda()
uses (squared), possibly constrained, Mahalanobis distances.
Usage
tclustfsda(
x,
k,
alpha,
restrfactor = 12,
monitoring = FALSE,
plot = FALSE,
nsamp,
refsteps = 15,
reftol = 1e-13,
equalweights = FALSE,
mixt = 0,
msg = FALSE,
nocheck = FALSE,
startv1 = 1,
RandNumbForNini,
restrtype = c("eigen", "deter"),
UnitsSameGroup,
numpool,
cleanpool,
trace = FALSE,
...
)
Arguments
x |
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables. Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations. |
k |
Number of groups. |
alpha |
A scalar between 0 and 0.5 or an integer specifying the number of
observations which have to be trimmed. If More in detail, if |
restrfactor |
Positive scalar which constrains the allowed differences among group scatters.
Larger values imply larger differences of group scatters. On the other hand
a value of 1 specifies the strongest restriction forcing all
eigenvalues/determinants to be equal and so the method looks
for similarly scattered (respectively spherical) clusters.
The default is to apply |
monitoring |
If |
plot |
If
When
The parameter
If The second plot (gscatter plot) shows a series of subplots which monitor the classification
for each value of
Also in the case of
|
nsamp |
If a scalar, it contains the number of subsamples which will be extracted.
If If
REMARK: If |
refsteps |
Number of refining iterations in each subsample. Default is |
reftol |
Tolerance of the refining steps. The default value is 1e-14 |
equalweights |
A logical specifying wheather cluster weights in the concentration
and assignment steps shall be considered. If |
mixt |
Specifies whether mixture modelling or crisp assignment approach to model
based clustering must be used. In the case of mixture modelling parameter mixt also
controls which is the criterion to find the untrimmed units in each step of the maximization.
If |
msg |
Controls whether to display or not messages on the screen. If |
nocheck |
Check input arguments. If |
startv1 |
How to initialize centroids and covariance matrices. Scalar.
If Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied. Remark 2 - option |
RandNumbForNini |
pre-extracted random numbers to initialize proportions.
Matrix of size k-by-nrow(nsamp) containing the random numbers which
are used to initialize the proportions of the groups. This option is effective just if
|
restrtype |
Type of restriction to be applied on the cluster scatter matrices.
Valid values are |
UnitsSameGroup |
List of the units which must (whenever possible) have
a particular label. For example |
numpool |
The number of parallel sessions to open. If numpool is not defined, then it is set equal to the number of physical cores in the computer. |
cleanpool |
Logical, indicating if the open pool must be closed or not. It is useful to leave it open if there are subsequent parallel sessions to execute, so that to save the time required to open a new pool. |
trace |
Whether to print intermediate results. Default is |
... |
potential further arguments passed to lower level functions. |
Details
This iterative algorithm initializes k
clusters randomly and performs
concentration steps in order to improve the current cluster assignment. The number of
maximum concentration steps to be performed is given by input parameter refsteps
.
For approximately obtaining the global optimum, the system is initialized nsamp
times and concentration steps are performed until convergence or refsteps
is
reached. When processing more complex data sets higher values of nsamp
and
refsteps
have to be specified (obviously implying extra computation time).
However, if more then 10 per cent of the iterations do not converge, a warning message
is issued, indicating that nsamp
has to be increased.
Value
Depending on the input parameter monitoring
, one of
the following objects will be returned:
Author(s)
FSDA team, valentin.todorov@chello.at
References
Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008). A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics, Vol. 36, 1324-1345. doi:10.1214/07-AOS515.
Examples
## Not run:
data(hbk, package="robustbase")
(out <- tclustfsda(hbk[, 1:3], k=2))
class(out)
summary(out)
## TCLUST of Gayser data with three groups (k=3), 10%% trimming (alpha=0.1)
## and restriction factor (c=10000)
data(geyser2)
(out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000))
## Use the plot options to produce more complex plots ----------
## Plot with all default options
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=TRUE)
## Default confidence ellipses.
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="ellipse")
## Confidence ellipses specified by the user: confidence ellipses set to 0.5
plots <- list(type="ellipse", conflev=0.5)
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)
## Confidence ellipses set to 0.9
plots <- list(type="ellipse", conflev=0.9)
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)
## Contour plots
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="contour")
## Filled contour plots with additional options: contourf plot with a named colormap.
## Here we define four MATLAB-like colormaps, but the user can define anything else,
## presented by a matrix with three columns which are the RGB triplets.
summer <- as.matrix(data.frame(x1=seq(from=0, to=1, length=65),
x2=seq(from=0.5, to=1, length=65),
x3=rep(0.4, 65)))
spring <- as.matrix(data.frame(x1=rep(1, 65),
x2=seq(from=0, to=1, length=65),
x3=seq(from=1, to=0, length=65)))
winter <- as.matrix(data.frame(x1=rep(0, 65),
x2=seq(from=0, to=1, length=65),
x3=seq(from=1, to=0, length=65)))
autumn <- as.matrix(data.frame(x1=rep(1, 65),
x2=seq(from=0, to=1, length=65),
x3=rep(0, 65)))
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
plot=list(type="contourf", cmap=autumn))
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
plot=list(type="contourf", cmap=winter))
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
plot=list(type="contourf", cmap=spring))
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,
plot=list(type="contourf", cmap=summer))
## We compare the output using three different values of restriction factor
## nsamp is the number of subsamples which will be extracted
data(geyser2)
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, nsamp=500, plot="ellipse")
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10, nsamp=500, refsteps=10, plot="ellipse")
out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=1, nsamp=500, refsteps=10, plot="ellipse")
## TCLUST applied to M5 data: A bivariate data set obtained from three normal
## bivariate distributions with different scales and proportions 1:2:2. One of the
## components is very overlapped with another one. A 10 per cent background noise is
## added uniformly distributed in a rectangle containing the three normal components
## and not very overlapped with the three mixture components. A precise description
## of the M5 data set can be found in Garcia-Escudero et al. (2008).
##
data(M5data)
plot(M5data[, 1:2])
## Scatter plot matrix
library(rrcov)
plot(CovClassic(M5data[,1:2]), which="pairs")
out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=1000, nsamp=100, plot=TRUE)
out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=10, nsamp=100, plot=TRUE)
out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1, nsamp=1000,
plot=TRUE, equalweights=TRUE)
out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1000, nsamp=100, plot=TRUE)
## TCLUST with simulated data: 5 groups and 5 variables
##
n1 <- 100
n2 <- 80
n3 <- 50
n4 <- 80
n5 <- 70
p <- 5
Y1 <- matrix(rnorm(n1 * p) + 5, ncol=p)
Y2 <- matrix(rnorm(n2 * p) + 3, ncol=p)
Y3 <- matrix(rnorm(n3 * p) - 2, ncol=p)
Y4 <- matrix(rnorm(n4 * p) + 2, ncol=p)
Y5 <- matrix(rnorm(n5 * p), ncol=p)
group <- c(rep(1, n1), rep(2, n2), rep(3, n3), rep(4, n4), rep(5, n5))
Y <- Y1
Y <- rbind(Y, Y2)
Y <- rbind(Y, Y3)
Y <- rbind(Y, Y4)
Y <- rbind(Y, Y5)
dim(Y)
table(group)
out <- tclustfsda(Y, k=5, alpha=0.05, restrfactor=1.3, refsteps=20, plot=TRUE)
## Automatic choice of best number of groups for Geyser data ------------------------
##
data(geyser2)
maxk <- 6
CLACLA <- matrix(0, nrow=maxk, ncol=2)
CLACLA[,1] <- 1:maxk
MIXCLA <- MIXMIX <- CLACLA
for(j in 1:maxk) {
out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5)
CLACLA[j, 2] <- out$CLACLA
}
for(j in 1:maxk) {
out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5, mixt=2)
MIXMIX[j ,2] <- out$MIXMIX
MIXCLA[j, 2] <- out$MIXCLA
}
oldpar <- par(mfrow=c(1,3))
plot(CLACLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="CLACLA")
plot(MIXMIX[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXMIX")
plot(MIXCLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXCLA")
par(oldpar)
## Monitoring examples ------------------------------------------
## Monitoring using Geyser data
## Monitoring using Geyser data (all default options)
## alpha and restriction factor are not specified therefore alpha=c(0.10, 0.05, 0)
## is used while the restriction factor is set to c=12
out <- tclustfsda(geyser2, k=3, monitoring=TRUE)
## Monitoring using Geyser data with alpha and c specified.
out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.10, 0, by=-0.01), monitoring=TRUE)
## Monitoring using Geyser data with plot argument specified as a list.
## The trimming levels to consider in this case are 31 values of alpha
##
out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,
plot=list(alphasel=c(0.2, 0.10, 0.05, 0.01)), trace=TRUE)
## Monitoring using Geyser data with argument UnitsSameGroup
##
## Make sure that group containing unit 10 is in a group which is labelled "group 1"
## and group containing unit 12 is in group which is labelled "group 2"
##
## Mixture model is used (mixt=2)
##
out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,
mixt=2, UnitsSameGroup=c(10, 12), trace=TRUE)
## Monitoring using M5 data
data(M5data)
## alphavec=vector which contains the trimming levels to consider
## in this case 31 values of alpha are considered
alphavec <- seq(0.10, 0, by=-0.02)
out <- tclustfsda(M5data[, 1:2], 3, alpha=alphavec, restrfac=1000, monitoring=TRUE,
nsamp=1000, plots=TRUE)
## End(Not run)