qualityCriterion {longitudinalData} | R Documentation |
~ Function: qualityCriterion ~
Description
Given a LongData
and a
Partition
, the fonction qualityCriterion
calculate
some qualities criterion.
Usage
qualityCriterion(traj,clusters,imputationMethod="copyMean")
Arguments
traj |
|
clusters |
|
imputationMethod |
|
Details
Given a LongData
and a
Partition
(or a matrix
and a vector of
integer
), the fonction qualityCriterion
calculate several
quality criterion and return then as a list (see 'value' below).
If some individual have no clusters (ie if Partition
has some
missing values), the corresponding trajectories are exclude from the
calculation.
Note that if there is an empty cluster or an empty trajectory, most of the criterions are anavailable.
Basicaly, 6 non-parametrics criterions are computed.
In addition, ASSUMING THAT in each clusters C and for each time T,
the variable follow a NORMAL LAW (mean and standard deviation of the variable at time T restricted
to clusters C), it is possible to compute the the posterior
probabilities of the individual trajectories and the
likelihood. From there, we can also compute the BIC, the AIC and
the global posterior probability. The function qualityCriterion
also compute these criterion. But the user should alway keep in mind
that these criterion are
valid ONLY under the hypothesis of normality. If this
hypothèsis is not respected, algorithm like k-means will converge but the BIC and AIC
will have no meaning.
IMPORTANT NOTE: Some criterion should be maximized, some other should be
minimized. This might be confusing for the non expert. In order to
simplify the comparison of the criterion, qualityCriterion
compute the OPPOSITE of the criterion that should be minimized (Ray & Bouldin, Davies & Turi, BIC and AIC). Thus,
all the criterion computed by this function should be maximized.
Value
A list with three fields: the first is the list of the criterions. the second is the clusters post probabilities; the third is the matrix of the individual post probabilities.
Non-parametric criterion
Notations: k=number of clusters; n=number of individual; B=Between variance ; W=Within variance The criterion are:
- Calinski.Harabatz
[numeric]
: Calinski and Harabatz criterion:c(k)=Trace(B)/Trace(W)*(n-k)/(k-1)
.- Calinski.Harabatz2
[numeric]
: Calinski and Harabatz criterion modified by Krysczuk:c(k)=Trace(B)/Trace(W)*(n-1)/(n-k)
.- Calinski.Harabatz3
[numeric]
: Calinski and Harabatz criterion modified by Genolini:g(k)=Trace(B)/Trace(W)*(n-k)/sqrt(k-1)
.- Ray.Turi
[numeric]
: Ray and Turi criterion:r(k)=-Vintra/Vinter
withVintra=Sum(dist(x,center(x)))
andVinter=min(dist(center_i,center_j)^2)
. (The "true" index of Ray and Turi isVintra/Vinter
and should me minimized. See IMPORTANT NOTE above.)- Davies.Bouldin
[numeric]
: Davies and Bouldin criterion:d(k)=-mean(Proximite(cluster_i,cluster_j))
withProximite(i,j)=(DistInterne(i)+DistInterne(j))/(DistExterne(i,j))
. (The "true" index of Davies and Bouldin ismean(Proximite())
and should me minimized. See IMPORTANT NOTE above.)- random
[numeric]
: random value following the normal law N(0,1).
Parametric criterion
All the parametric indices should be minimized. So the function
qualityCriterion
compute their opposite (see IMPORTANT NOTE above.)
Notation: L=likelihood; h=number of parameters; n=number of trajectories; t=number of time measurement; N=total number of measurement (N=t.n).
SECOND IMPORTANT NOTE: the formula of parametrics criterion ofen
include the size of the population. In the specific case on
longitudinal data, the definition of the "size of the population" is
not obvious. It can be either the number of individual n
, or the number of
measurement N=n.t
. So, the function qualityCriterion
gives
two version of all the non parametrics criterion, the first using n
,
the second using N
.
- BIC
[numeric]
: Bayesian Information Criterion: BIC=2*log(L)-h*log(n). See IMPORTANT NOTE above.- BIC2
[numeric]
: Bayesian Information Criterion: BIC=2*log(L)-h*log(N). See IMPORTANT NOTE above.- AIC
[numeric]
: Akaike Information Criterion, bis: AIC=2*log(L)-2*h. See IMPORTANT NOTE above.- AICc
[numeric]
: Akaike Information Criterion with correction: AIC=AIC+(2h(h+1))/(n-h-1). See IMPORTANT NOTE above.- AICc2
[numeric]
: Akaike Information Criterion with correction, bis: AIC=AIC+(2h(h+1))/(n-h-1). See IMPORTANT NOTE above.
Author
Christophe Genolini
1. UMR U1027, INSERM, Université Paul Sabatier / Toulouse III / France
2. CeRSM, EA 2931, UFR STAPS, Université de Paris Ouest-Nanterre-La Défense / Nanterre / France
References
[1] C. Genolini and B. Falissard
"KmL: k-means for longitudinal data"
Computational Statistics, vol 25(2), pp 317-328, 2010
[2] C. Genolini and B. Falissard
"KmL: A package to cluster longitudinal data"
Computer Methods and Programs in Biomedicine, 104, pp e112-121, 2011
See Also
LongData
, Partition
,
imputation
.
Examples
##################
### Preparation of some artificial data
par(ask=TRUE)
data(artificialLongData)
ld <- longData(artificialLongData)
### Correct partition
part1 <- partition(rep(1:4,each=50))
plotTrajMeans(ld,part1)
(cr1 <- qualityCriterion(ld,part1))
### Random partition
part2 <- partition(floor(runif(200,1,5)))
plotTrajMeans(ld,part2)
(cr2 <- qualityCriterion(ld,part2))
### Partition with 3 clusters instead of 4
part3 <- partition(rep(c(1,2,3,3),each=50))
plotTrajMeans(ld,part3)
(cr3 <- qualityCriterion(ld,part3))
### Comparisons of the Partition
plot(c(cr1[[1]],cr2[[1]],cr3[[1]]),main="The highest give the best partition
(according to Calinski & Harabatz criterion)")
par(ask=FALSE)