fixmahal {fpc} | R Documentation |
Mahalanobis Fixed Point Clusters
Description
Computes Mahalanobis fixed point clusters (FPCs), i.e., subsets of the data, which consist exactly of the non-outliers w.r.t. themselves, and may be interpreted as generated from a homogeneous normal population. FPCs may overlap, are not necessarily exhausting and do not need a specification of the number of clusters.
Note that while fixmahal
has lots of parameters, only one (or
few) of them have usually to be specified, cf. the examples. The
philosophy is to allow much flexibility, but to always provide
sensible defaults.
Usage
fixmahal(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)),
method = "fuzzy", cgen = "fixed",
ca = NA, ca2 = NA,
calpha = ifelse(method=="fuzzy",0.95,0.99),
calpha2 = 0.995,
pointit = TRUE, subset = n,
nc1 = 100+20*p,
startn = 18+p, mnc = floor(startn/2),
mer = ifelse(pointit,0.1,0),
distcut = 0.85, maxit = 5*n, iter = n*1e-5,
init.group = list(),
ind.storage = TRUE, countmode = 100,
plot = "none")
## S3 method for class 'mfpc'
summary(object, ...)
## S3 method for class 'summary.mfpc'
print(x, maxnc=30, ...)
## S3 method for class 'mfpc'
plot(x, dat, no, bw=FALSE, main=c("Representative FPC No. ",no),
xlab=NULL, ylab=NULL,
pch=NULL, col=NULL, ...)
## S3 method for class 'mfpc'
fpclusters(object, dat=NA, ca=object$ca, p=object$p, ...)
fpmi(dat, n = nrow(as.matrix(dat)), p = ncol(as.matrix(dat)),
gv, ca, ca2, method = "ml", plot,
maxit = 5*n, iter = n*1e-6)
Arguments
dat |
something that can be coerced to a
numerical matrix or vector. Data matrix, rows are points, columns
are variables.
|
n |
optional positive integer. Number of cases. |
p |
optional positive integer. Number of independent variables. |
method |
a string. |
cgen |
optional string. |
ca |
optional positive number. Tuning constant, specifying
required cluster
separation. By default determined as |
ca2 |
optional positive number. Second tuning constant needed if
|
calpha |
number between 0 and 1. See |
calpha2 |
number between 0 and 1, larger than |
pointit |
optional logical. If |
subset |
optional positive integer smaller or equal than |
nc1 |
optional positive integer. Tuning constant needed by
|
startn |
optional positive integer. Size of the initial configurations. The default value is chosen to prevent that small meaningless FPCs are found, but it should be decreased if clusters of size smaller than the default value are of interest. |
mnc |
optional positive integer. Minimum size of clusters to be reported. |
mer |
optional nonnegative number. FPCs (groups of them,
respectively, see details)
are only reported as stable if the ratio
of the number of their
findings to their number of points exceeds |
distcut |
optional value between 0 and 1. A similarity
measure between FPCs, given in Hennig (2002), and the corresponding
Single Linkage groups of FPCs with similarity larger
than |
maxit |
optional integer. Maximum number of iterations per algorithm run (usually an FPC is found much earlier). |
iter |
positive number. Algorithm stops when difference between
subsequent weight vectors is smaller than |
init.group |
optional list of logical vectors of length
|
ind.storage |
optional logical. If |
countmode |
optional positive integer. Every |
plot |
optional string. If |
object |
object of class |
x |
object of class |
maxnc |
positive integer. Maximum number of FPCs to be reported. |
no |
positive integer. Number of the representative FPC to be plotted. |
bw |
optional logical. If |
main |
plot title. |
xlab |
label for x-axis. If |
ylab |
label for y-axis. If |
pch |
plotting symbol, see |
col |
plotting color, see |
gv |
logical vector (or, with |
... |
additional parameters to be passed to |
Details
A (crisp) Mahalanobis FPC is a data subset
that reproduces itself under the following operation:
Compute mean and covariance matrix estimator for the data
subset, and compute all points of the dataset for which the squared
Mahalanobis distance is smaller than ca
.
Fixed points of this operation can be considered as clusters,
because they contain only
non-outliers (as defined by the above mentioned procedure) and all other
points are outliers w.r.t. the subset.
The current default is to compute fuzzy Mahalanobis FPCs, where the
points in the subset have a membership weight between 0 and 1 and give
rise to weighted means and covariance matrices.
The new weights are then obtained by computing the weight function
wfu
of the squared Mahalanobis distances, i.e.,
full weight for squared distances smaller than ca
, zero
weight for squared distances larger than ca2
and
decreasing weights (linear function of squared distances)
in between.
A fixed point algorithm is started from the whole dataset,
algorithms are started from the subsets specified in
init.group
, and further algorithms are started from further
initial configurations as explained under subset
and in the
function mahalconf
.
Usually some of the FPCs are unstable, and more than one FPC may
correspond to the same significant pattern in the data. Therefore the
number of FPCs is reduced: A similarity matrix is computed
between FPCs. Similarity between sets is defined as the ratio between
2 times size of
intersection and the sum of sizes of both sets. The Single Linkage
clusters (groups)
of level distcut
are computed, i.e. the connectivity
components of the graph where edges are drawn between FPCs with
similarity larger than distcut
. Groups of FPCs whose members
are found often enough (cf. parameter mer
) are considered as
stable enough. A representative FPC is
chosen for every Single Linkage cluster of FPCs according to the
maximum expectation ratio ser
. ser
is the ratio between
the number of findings of an FPC and the number of points
of an FPC, adjusted suitably if subset<n
.
Usually only the representative FPCs of stable groups
are of interest.
Default tuning constants are taken from Hennig (2005).
Generally, the default settings are recommended for
fixmahal
. For large datasets, the use of
init.group
together with pointit=FALSE
is useful. Occasionally, mnc
and startn
may be chosen
smaller than the default,
if smaller clusters are of interest, but this may lead to too many
clusters. Decrease of
ca
will often lead to too many clusters, even for homogeneous
data. Increase of ca
will produce only very strongly
separated clusters. Both may be of interest occasionally.
Singular covariance matrices during the iterations are handled by
solvecov
.
summary.mfpc
gives a summary about the representative FPCs of
stable groups.
plot.mfpc
is a plot method for the representative FPC of stable
group no. no
. It produces a scatterplot, where
the points belonging to the FPC are highlighted, the mean is and
for p<3
also the region of the FPC is shown. For p>=3
,
the optimal separating projection computed by batcoord
is shown.
fpclusters.mfpc
produces a list of indicator vectors for the
representative FPCs of stable groups.
fpmi
is called by fixmahal
for a single fixed point
algorithm and will usually not be executed alone.
Value
fixmahal
returns an object of class mfpc
. This is a list
containing the components nc, g, means, covs, nfound, er, tsc,
ncoll, skc, grto, imatrix, smatrix, stn, stfound, ser, sfpc, ssig,
sto, struc, n, p, method, cgen, ca, ca2, cvec, calpha, pointit,
subset, mnc, startn, mer, distcut
.
summary.mfpc
returns an object of class summary.mfpc
.
This is a list containing the components means, covs, stn,
stfound, sn, ser, tskip, skc, tsc, sim, ca, ca2, calpha, mer, method,
cgen, pointit
.
fpclusters.mfpc
returns a list of indicator vectors for the
representative FPCs of stable groups.
fpmi
returns a list with the components mg, covg, md,
gv, coll, method, ca
.
nc |
integer. Number of FPCs. |
g |
list of logical vectors. Indicator vectors of FPCs. |
means |
list of numerical vectors. Means of FPCs. In
|
covs |
list of numerical matrices. Covariance matrices of FPCs. In
|
nfound |
vector of integers. Number of findings for the FPCs. |
er |
numerical vector. Ratio of number of findings of FPCs to their
size. Under |
tsc |
integer. Number of algorithm runs leading to too small or too seldom found FPCs. |
ncoll |
integer. Number of algorithm runs where collinear covariance matrices occurred. |
skc |
integer. Number of skipped clusters. |
grto |
vector of integers. Numbers of FPCs to which algorithm
runs led, which were started by |
imatrix |
vector of integers. Size of intersection between
FPCs. See |
smatrix |
numerical vector. Similarities between
FPCs. See |
stn |
integer. Number of representative FPCs of stable groups.
In |
stfound |
vector of integers. Number of findings of members of
all groups of FPCs. In |
ser |
numerical vector. Ratio of number of findings of groups of
FPCs to their size. Under |
sfpc |
vector of integers. Numbers of representative FPCs of all groups. |
ssig |
vector of integers of length |
sto |
vector of integers. Numbers of groups ordered
according to largest |
struc |
vector of integers. Number of group an FPC belongs to. |
n |
see arguments. |
p |
see arguments. |
method |
see arguments. |
cgen |
see arguments. |
ca |
see arguments, if |
ca2 |
see arguments. |
cvec |
numerical vector of length |
calpha |
see arguments. |
pointit |
see arguments. |
subset |
see arguments. |
mnc |
see arguments. |
startn |
see arguments. |
mer |
see arguments. |
distcut |
see arguments. |
sn |
vector of integers. Number of points of representative FPCs. |
tskip |
integer. Number of algorithm runs leading to skipped FPCs. |
sim |
vector of integers. Size of intersections between
representative FPCs of stable groups. See |
mg |
mean vector. |
covg |
covariance matrix. |
md |
Mahalanobis distances. |
gv |
logical (numerical, respectively, if |
coll |
logical. |
Author(s)
Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/
References
Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.
Hennig, C. (2005) Fuzzy and Crisp Mahalanobis Fixed Point Clusters, in Baier, D., Decker, R., and Schmidt-Thieme, L. (eds.): Data Analysis and Decision Support. Springer, Heidelberg, 47-56.
Hennig, C. and Christlieb, N. (2002) Validating visual clusters in large datasets: Fixed point clusters of spectral features, Computational Statistics and Data Analysis 40, 723-739.
See Also
fixreg
for linear regression fixed point clusters.
mahalconf
, wfu
, cmahal
for computation of initial configurations, weights, tuning constants.
sseg
for indexing the similarity/intersection vectors
computed by fixmahal
.
batcoord
, cov.rob
, solvecov
,
cov.wml
, plotcluster
for computation of projections, (inverted)
covariance matrices, plotting.
rFace
for generation of example data, see below.
Examples
options(digits=2)
set.seed(20000)
face <- rFace(400,dMoNo=2,dNoEy=0, p=3)
# The first example uses grouping information via init.group.
initg <- list()
grface <- as.integer(attr(face,"grouping"))
for (i in 1:5) initg[[i]] <- (grface==i)
ff0 <- fixmahal(face, pointit=FALSE, init.group=initg)
summary(ff0)
cff0 <- fpclusters(ff0)
plot(face, col=1+cff0[[1]])
plot(face, col=1+cff0[[4]]) # Why does this come out as a cluster?
plot(ff0, face, 4) # A bit clearer...
# Without grouping information, examples need more time:
# ff1 <- fixmahal(face)
# summary(ff1)
# cff1 <- fpclusters(ff1)
# plot(face, col=1+cff1[[1]])
# plot(face, col=1+cff1[[6]]) # Why does this come out as a cluster?
# plot(ff1, face, 6) # A bit clearer...
# ff2 <- fixmahal(face,method="ml")
# summary(ff2)
# ff3 <- fixmahal(face,method="ml",calpha=0.95,subset=50)
# summary(ff3)
## ...fast, but lots of clusters. mer=0.3 may be useful here.
# set.seed(3000)
# face2 <- rFace(400,dMoNo=2,dNoEy=0)
# ff5 <- fixmahal(face2)
# summary(ff5)
## misses right eye of face data; with p=6,
## initial configurations are too large for 40 point clusters
# ff6 <- fixmahal(face2, startn=30)
# summary(ff6)
# cff6 <- fpclusters(ff6)
# plot(face2, col=1+cff6[[3]])
# plot(ff6, face2, 3)
# x <- c(1,2,3,6,6,7,8,120)
# ff8 <- fixmahal(x)
# summary(ff8)
# ...dataset a bit too small for the defaults...
# ff9 <- fixmahal(x, mnc=3, startn=3)
# summary(ff9)