hclustcompro_select_alpha {SPARTAAS} | R Documentation |
Estimate the optimal value(s) of the
parameter.
Description
The following criterion "balances" the weight of and
in the final clustering. The
value is only a point estimate but the confidence interval gives a range of possible values.
Based on a resampling process, we generate clones and recalculate the criteria according to (see below).
Usage
hclustcompro_select_alpha(
D1,
D2,
acc=2,
resampling=TRUE,
method="ward.D2",
iter=5,
suppl_plot=TRUE
)
Arguments
D1 |
First dissimilarity matrix or contingency table (square matrix). |
D2 |
Second dissimilarity matrix or network data (square matrix) of the same size as D1. |
acc |
Number of digits after the decimal point for the alpha value. |
resampling |
Logical for estimating the confidence interval with a resampling strategy. If you have a lot of data, you can save computation time by setting this option to FALSE. |
method |
The agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
iter |
The number of clones checked per observation. (200 observations iter=1: ~30 sec, 1000 observations iter=1: ~40 min). |
suppl_plot |
Logical defines whether additional plots should be displayed. |
Details
Definition of the criterion:
A criterion for choosing must be determined by balancing the weights between the two sources of information in the final classification. To obtain
, we define the following criterion
The criterion in (1) represents the difference in absolute value between two cophenetic correlations (cophenetic correlation is defined as the correlation between two distance matrices. It is calculated by considering the half distance matrices as vectors. It measures how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points). The first correlation is related to the comparison between
and the ultrametric distances of the clustering with
fixed, while the second compares
and the ultrametric distances of the clustering with
fixed. Then, in order to compromise between the information provided by
and
, we decided to estimate
with
such that:
Resampling strategy:
This is done by creating a set of "clones" for each observation . A clone
of observation
is a copy of observation
for which the distances from the second source have been modified. The modification is made by copying the distances for the second source from another observation
. A clustering is then performed using the combination defined in (1) with
an
matrix where observations
and
are identical and
an
matrix where the clone
of
has different distances from those of
. A set of clones is generated by varying
for all observations except
. We can generate a set of
clones for each element
in
, so
clones in total.
Intuitively, by varying between 0 and 1, we will be able to identify when the clone and the original observation are separated on the dendrogram. This moment will correspond to the value of alpha above which the weight given to the information about the connection between observations contained in
has too much influence on the results compared to that of
.
Let define the same criterion as in (1), where
and
are replaced by
and
respectively.
The estimated
is the mean of the estimated values for each clone.
For each clone :
is the mean of
. In the same spirit as confidence intervals based on bootstrap percentiles (Efron & Tibshirani, 1993), a percentile confidence interval based on replication is also be obtained using the empirical percentiles of the distribution of
.
Warnings:
It is possible to observe an value outside the confidence interval. In some cases, this problem can be solved by increasing the number of iterations or by changing the number of axes used to construct the matrix D1 after the correspondence analysis. If the
value remains outside the interval, it means that the data are noisy and the resampling procedure is affected.
Value
The function returns a list (class: selectAlpha_obj).
alpha |
The estimated value of the alpha parameter (min CorCrit_alpha) |
alpha.plot |
The CorCrit curve for alpha between 0 and 1 |
If resampling = TRUE
sd |
The standard deviation |
conf |
The confidence interval of alpha |
boxplot |
The boxplot of alpha estimation with resampling |
values |
All potential alpha values obtained from each clone |
Author(s)
A. COULON
L. BELLANGER
P. HUSI
Examples
#################################
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
##-- or do help(data=index) for the standard data sets.
library(SPARTAAS)
data(datangkor)
#network stratigraphic data (Network)
network <- datangkor$stratigraphy
#contingency table
cont <- datangkor$contingency
dissimilarity <- CAdist(cont,nPC="max",graph=FALSE)
constraint <- adjacency(network)
hclustcompro_select_alpha(D1 = dissimilarity, D2 = constraint)
hclustcompro_select_alpha(D1 = dissimilarity, D2 = constraint, acc = 3, resampling = TRUE)