R: ~ Function: qualityCriterion ~

qualityCriterion {longitudinalData}

R Documentation

~ Function: qualityCriterion ~

Description

Given a LongData and a Partition, the fonction qualityCriterion calculate some qualities criterion.

Usage

qualityCriterion(traj,clusters,imputationMethod="copyMean")

Arguments

`traj`	`[LongData]` or `[matrix]`: object containing the trajectories on which the criterion is calculate.
`clusters`	`[Paritition]` or `[vector(integer)]`: clusters to which individual belongs.
`imputationMethod`	`[character]`: if some value are missing in the `LongData`, it is necessary to impute them. Then the function `qualityCriterion` call the function `imputation` using the method `method`.

Details

Given a LongData and a Partition (or a matrix and a vector of integer), the fonction qualityCriterion calculate several quality criterion and return then as a list (see 'value' below).

If some individual have no clusters (ie if Partition has some missing values), the corresponding trajectories are exclude from the calculation.

Note that if there is an empty cluster or an empty trajectory, most of the criterions are anavailable.

Basicaly, 6 non-parametrics criterions are computed. In addition, ASSUMING THAT in each clusters C and for each time T, the variable follow a NORMAL LAW (mean and standard deviation of the variable at time T restricted to clusters C), it is possible to compute the the posterior probabilities of the individual trajectories and the likelihood. From there, we can also compute the BIC, the AIC and the global posterior probability. The function qualityCriterion also compute these criterion. But the user should alway keep in mind that these criterion are valid ONLY under the hypothesis of normality. If this hypothèsis is not respected, algorithm like k-means will converge but the BIC and AIC will have no meaning.

IMPORTANT NOTE: Some criterion should be maximized, some other should be minimized. This might be confusing for the non expert. In order to simplify the comparison of the criterion, qualityCriterion compute the OPPOSITE of the criterion that should be minimized (Ray & Bouldin, Davies & Turi, BIC and AIC). Thus, all the criterion computed by this function should be maximized.

Value

A list with three fields: the first is the list of the criterions. the second is the clusters post probabilities; the third is the matrix of the individual post probabilities.

Non-parametric criterion

Notations: k=number of clusters; n=number of individual; B=Between variance ; W=Within variance The criterion are:

Calinski.Harabatz: [numeric]: Calinski and Harabatz criterion: c(k)=Trace(B)/Trace(W)*(n-k)/(k-1).
Calinski.Harabatz2: [numeric]: Calinski and Harabatz criterion modified by Krysczuk: c(k)=Trace(B)/Trace(W)*(n-1)/(n-k).
Calinski.Harabatz3: [numeric]: Calinski and Harabatz criterion modified by Genolini: g(k)=Trace(B)/Trace(W)*(n-k)/sqrt(k-1).
Ray.Turi: [numeric]: Ray and Turi criterion: r(k)=-Vintra/Vinter with Vintra=Sum(dist(x,center(x))) and Vinter=min(dist(center_i,center_j)^2). (The "true" index of Ray and Turi is Vintra/Vinter and should me minimized. See IMPORTANT NOTE above.)
Davies.Bouldin: [numeric]: Davies and Bouldin criterion: d(k)=-mean(Proximite(cluster_i,cluster_j)) with Proximite(i,j)=(DistInterne(i)+DistInterne(j))/(DistExterne(i,j)). (The "true" index of Davies and Bouldin is mean(Proximite()) and should me minimized. See IMPORTANT NOTE above.)
random: [numeric]: random value following the normal law N(0,1).

Parametric criterion

All the parametric indices should be minimized. So the function qualityCriterion compute their opposite (see IMPORTANT NOTE above.)

Notation: L=likelihood; h=number of parameters; n=number of trajectories; t=number of time measurement; N=total number of measurement (N=t.n).

SECOND IMPORTANT NOTE: the formula of parametrics criterion ofen include the size of the population. In the specific case on longitudinal data, the definition of the "size of the population" is not obvious. It can be either the number of individual n, or the number of measurement N=n.t. So, the function qualityCriterion gives two version of all the non parametrics criterion, the first using n, the second using N.

BIC: [numeric]: Bayesian Information Criterion: BIC=2*log(L)-h*log(n). See IMPORTANT NOTE above.
BIC2: [numeric]: Bayesian Information Criterion: BIC=2*log(L)-h*log(N). See IMPORTANT NOTE above.
AIC: [numeric]: Akaike Information Criterion, bis: AIC=2*log(L)-2*h. See IMPORTANT NOTE above.
AICc: [numeric]: Akaike Information Criterion with correction: AIC=AIC+(2h(h+1))/(n-h-1). See IMPORTANT NOTE above.
AICc2: [numeric]: Akaike Information Criterion with correction, bis: AIC=AIC+(2h(h+1))/(n-h-1). See IMPORTANT NOTE above.

Author

Christophe Genolini
1. UMR U1027, INSERM, Université Paul Sabatier / Toulouse III / France
2. CeRSM, EA 2931, UFR STAPS, Université de Paris Ouest-Nanterre-La Défense / Nanterre / France

References

[1] C. Genolini and B. Falissard
"KmL: k-means for longitudinal data"
Computational Statistics, vol 25(2), pp 317-328, 2010

[2] C. Genolini and B. Falissard
"KmL: A package to cluster longitudinal data"
Computer Methods and Programs in Biomedicine, 104, pp e112-121, 2011

Examples

##################
### Preparation of some artificial data
par(ask=TRUE)
data(artificialLongData)
ld <- longData(artificialLongData)


### Correct partition
part1 <- partition(rep(1:4,each=50))
plotTrajMeans(ld,part1)
(cr1 <- qualityCriterion(ld,part1))

### Random partition
part2 <- partition(floor(runif(200,1,5)))
plotTrajMeans(ld,part2)
(cr2 <- qualityCriterion(ld,part2))

### Partition with 3 clusters instead of 4
part3 <- partition(rep(c(1,2,3,3),each=50))
plotTrajMeans(ld,part3)
(cr3 <- qualityCriterion(ld,part3))


### Comparisons of the Partition
plot(c(cr1[[1]],cr2[[1]],cr3[[1]]),main="The highest give the best partition
(according to Calinski & Harabatz criterion)")
par(ask=FALSE)

[Package longitudinalData version 2.4.5.1 Index]