Hidalgo {intRinsic} | R Documentation |
Fit the Hidalgo
model
Description
The function fits the Heterogeneous intrinsic dimension algorithm, developed in Allegra et al., 2020. The model is a Bayesian mixture of Pareto distribution with modified likelihood to induce homogeneity across neighboring observations. The model can segment the observations into multiple clusters characterized by different intrinsic dimensions. This permits to capture hidden patterns in the data. For more details on the algorithm, refer to Allegra et al., 2020. For an example of application to basketball data, see Santos-Fernandez et al., 2021.
Usage
Hidalgo(
X = NULL,
dist_mat = NULL,
K = 10,
nsim = 5000,
burn_in = 5000,
thinning = 1,
verbose = TRUE,
q = 3,
xi = 0.75,
alpha_Dirichlet = 0.05,
a0_d = 1,
b0_d = 1,
prior_type = c("Conjugate", "Truncated", "Truncated_PointMass"),
D = NULL,
pi_mass = 0.5
)
## S3 method for class 'Hidalgo'
print(x, ...)
## S3 method for class 'Hidalgo'
plot(x, type = c("A", "B", "C"), class = NULL, ...)
## S3 method for class 'Hidalgo'
summary(object, ...)
## S3 method for class 'summary.Hidalgo'
print(x, ...)
Arguments
X |
data matrix with |
dist_mat |
distance matrix computed between the |
K |
integer, number of mixture components. |
nsim |
number of MCMC iterations to run. |
burn_in |
number of MCMC iterations to discard as burn-in period. |
thinning |
integer indicating the thinning interval. |
verbose |
logical, should the progress of the sampler be printed? |
q |
integer, first local homogeneity parameter. Default is 3. |
xi |
real number between 0 and 1, second local homogeneity parameter. Default is 0.75. |
alpha_Dirichlet |
parameter of the symmetric Dirichlet prior on the mixture weights. Default is 0.05, inducing a sparse mixture. Values that are too small (i.e., lower than 0.005) may cause underflow. |
a0_d |
shape parameter of the Gamma prior on |
b0_d |
rate parameter of the Gamma prior on |
prior_type |
character, type of Gamma prior on
|
D |
integer, the maximal dimension of the dataset. |
pi_mass |
probability placed a priori on |
x |
object of class |
... |
other arguments passed to specific methods. |
type |
character that indicates the type of plot that is requested. It can be:
|
class |
factor variable used to stratify observations according to
their the |
object |
object of class |
Value
object of class Hidalgo
, which is a list containing
cluster_prob
chains of the posterior mixture weights;
membership_labels
chains of the membership labels for all the observations;
id_raw
chains of the
K
intrinsic dimensions parameters, one per mixture component;id_postpr
a chain for each observation, corrected for label switching;
id_summary
a matrix containing, for each observation, the value of posterior mean and the 5%, 25%, 50%, 75%, 95% quantiles;
recap
a list with the objects and specifications passed to the function used in the estimation.
References
Allegra M, Facco E, Denti F, Laio A, Mira A (2020). “Data segmentation based on the local intrinsic dimension.” Scientific Reports, 10(1), 1–27. ISSN 20452322, doi:10.1038/s41598-020-72222-0,
Santos-Fernandez E, Denti F, Mengersen K, Mira A (2021). “The role of intrinsic dimension in high-resolution player tracking data – Insights in basketball.” Annals of Applied Statistics - Forthcoming, – ISSN 2331-8422, 2002.04148, doi:10.1038/s41598-022-20991-1
See Also
id_by_class
and clustering
to understand how to further postprocess the results.
Examples
set.seed(1234)
X <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
X[1:250,] <- X[1:250,] + 4
oracle <- rep(1:2,rep(250,2))
# this is just a short example
# increase the number of iterations to improve mixing and convergence
h_out <- Hidalgo(X, nsim = 500, burn_in = 500)
plot(h_out, type = "B")
id_by_class(h_out, oracle)