hclustcompro_select_alpha {SPARTAAS}R Documentation

Estimate the optimal value(s) of the \alpha parameter.

Description

The following criterion "balances" the weight of D_1 and D_2 in the final clustering. The \alpha value is only a point estimate but the confidence interval gives a range of possible values.

Based on a resampling process, we generate clones and recalculate the criteria according to \alpha (see below).

Usage

hclustcompro_select_alpha(
    D1,
    D2,
    acc=2,
    resampling=TRUE,
    method="ward.D2",
    iter=5,
    suppl_plot=TRUE
)

Arguments

D1

First dissimilarity matrix or contingency table (square matrix).

D2

Second dissimilarity matrix or network data (square matrix) of the same size as D1.

acc

Number of digits after the decimal point for the alpha value.

resampling

Logical for estimating the confidence interval with a resampling strategy. If you have a lot of data, you can save computation time by setting this option to FALSE.

method

The agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

iter

The number of clones checked per observation. (200 observations iter=1: ~30 sec, 1000 observations iter=1: ~40 min).

suppl_plot

Logical defines whether additional plots should be displayed.

Details

Definition of the criterion:

A criterion for choosing \alpha \in [0;1] must be determined by balancing the weights between the two sources of information in the final classification. To obtain \alpha, we define the following criterion

CorCrit_\alpha = |Cor(dist_{cophenetic},D_1) - Cor(dist_{cophenetic},D_2)|

Equation (1)

The CorCrit_\alpha criterion in (1) represents the difference in absolute value between two cophenetic correlations (cophenetic correlation is defined as the correlation between two distance matrices. It is calculated by considering the half distance matrices as vectors. It measures how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points). The first correlation is related to the comparison between D_1 and the ultrametric distances of the clustering with \alpha fixed, while the second compares D_2 and the ultrametric distances of the clustering with \alpha fixed. Then, in order to compromise between the information provided by D_1 and D_2, we decided to estimate \alpha with \hat{\alpha} such that:

\hat{\alpha} = min CorCrit_\alpha

Equation (2)

Resampling strategy:

This is done by creating a set of "clones" for each observation i. A clone c of observation i is a copy of observation i for which the distances from the second source have been modified. The modification is made by copying the distances for the second source from another observation j. A clustering is then performed using the combination defined in (1) with D_1^{(c)} an (n+1)\times(n+1) matrix where observations i and c are identical and D_2^{(c)} an (n+1)\times(n+1) matrix where the clone c of i has different distances from those of i. A set of clones is generated by varying j for all observations except i. We can generate a set of n-1 clones for each element i in n, so n(n-1) clones in total.

Intuitively, by varying \alpha between 0 and 1, we will be able to identify when the clone and the original observation are separated on the dendrogram. This moment will correspond to the value of alpha above which the weight given to the information about the connection between observations contained in D_2 has too much influence on the results compared to that of D_1.

Let CorCrit_\alpha^{(c)} define the same criterion as in (1), where D_1 and D_2 are replaced by D_1^{(c)} and D_2^{(c)} respectively. The estimated \alpha is the mean of the estimated values for each clone.
For each clone c:

\hat{\alpha}^{(c)} = min CorCrit_\alpha^{(c)}

Equation (3)

\hat{\alpha}^* is the mean of \hat{\alpha}^{(c)}. In the same spirit as confidence intervals based on bootstrap percentiles (Efron & Tibshirani, 1993), a percentile confidence interval based on replication is also be obtained using the empirical percentiles of the distribution of \hat{\alpha}^{(c)}.

\hat{\alpha}^* = \frac{1}{n(n-1)} \sum{ \hat\alpha^{(c)} }

Equation (4)

c \in [1; n(n-1)]

Warnings:

It is possible to observe an \alpha value outside the confidence interval. In some cases, this problem can be solved by increasing the number of iterations or by changing the number of axes used to construct the matrix D1 after the correspondence analysis. If the \alpha value remains outside the interval, it means that the data are noisy and the resampling procedure is affected.

Value

The function returns a list (class: selectAlpha_obj).

alpha

The estimated value of the alpha parameter (min CorCrit_alpha)

alpha.plot

The CorCrit curve for alpha between 0 and 1

If resampling = TRUE

sd

The standard deviation

conf

The confidence interval of alpha

boxplot

The boxplot of alpha estimation with resampling

values

All potential alpha values obtained from each clone

Author(s)

A. COULON

L. BELLANGER

P. HUSI

Examples

#################################

##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.
library(SPARTAAS)
data(datangkor)

#network stratigraphic data (Network)
network <- datangkor$stratigraphy

#contingency table
cont <- datangkor$contingency

dissimilarity <- CAdist(cont,nPC="max",graph=FALSE)
constraint <- adjacency(network)

hclustcompro_select_alpha(D1 = dissimilarity, D2 = constraint)
hclustcompro_select_alpha(D1 = dissimilarity, D2 = constraint, acc = 3, resampling = TRUE)


[Package SPARTAAS version 1.2.1 Index]