R: Fit the 'Hidalgo' model

Hidalgo {intRinsic}

R Documentation

Fit the `Hidalgo` model

Description

The function fits the Heterogeneous intrinsic dimension algorithm, developed in Allegra et al., 2020. The model is a Bayesian mixture of Pareto distribution with modified likelihood to induce homogeneity across neighboring observations. The model can segment the observations into multiple clusters characterized by different intrinsic dimensions. This permits to capture hidden patterns in the data. For more details on the algorithm, refer to Allegra et al., 2020. For an example of application to basketball data, see Santos-Fernandez et al., 2021.

Usage

Hidalgo(
  X = NULL,
  dist_mat = NULL,
  K = 10,
  nsim = 5000,
  burn_in = 5000,
  thinning = 1,
  verbose = TRUE,
  q = 3,
  xi = 0.75,
  alpha_Dirichlet = 0.05,
  a0_d = 1,
  b0_d = 1,
  prior_type = c("Conjugate", "Truncated", "Truncated_PointMass"),
  D = NULL,
  pi_mass = 0.5
)

## S3 method for class 'Hidalgo'
print(x, ...)

## S3 method for class 'Hidalgo'
plot(x, type = c("A", "B", "C"), class = NULL, ...)

## S3 method for class 'Hidalgo'
summary(object, ...)

## S3 method for class 'summary.Hidalgo'
print(x, ...)

Arguments

`X`	data matrix with `n` observations and `D` variables.
`dist_mat`	distance matrix computed between the `n` observations.
`K`	integer, number of mixture components.
`nsim`	number of MCMC iterations to run.
`burn_in`	number of MCMC iterations to discard as burn-in period.
`thinning`	integer indicating the thinning interval.
`verbose`	logical, should the progress of the sampler be printed?
`q`	integer, first local homogeneity parameter. Default is 3.
`xi`	real number between 0 and 1, second local homogeneity parameter. Default is 0.75.
`alpha_Dirichlet`	parameter of the symmetric Dirichlet prior on the mixture weights. Default is 0.05, inducing a sparse mixture. Values that are too small (i.e., lower than 0.005) may cause underflow.
`a0_d`	shape parameter of the Gamma prior on `d`.
`b0_d`	rate parameter of the Gamma prior on `d`.
`prior_type`	character, type of Gamma prior on `d`, can be `"Conjugate"` a conjugate Gamma distribution is elicited; `"Truncated"` the conjugate Gamma prior is truncated over the interval `(0,D)`; `"Truncated_PointMass"` same as `"Truncated"`, but a point mass is placed on `D`, to allow the `id` to be identically equal to the nominal dimension.
`D`	integer, the maximal dimension of the dataset.
`pi_mass`	probability placed a priori on `D` when `Truncated_PointMass` is chosen.
`x`	object of class `Hidalgo`, the output of the `Hidalgo()` function.
`...`	other arguments passed to specific methods.
`type`	character that indicates the type of plot that is requested. It can be: `"A"` plot the MCMC and the ergodic means NOT corrected for label switching; `"B"` plot the posterior mean and median of the id for each observation, after the chains are processed for label switching; `"C"` plot the estimated id distributions stratified by the groups specified in the class vector;
`class`	factor variable used to stratify observations according to their the `id` estimates.
`object`	object of class `Hidalgo`, the output of the `Hidalgo()` function.

Value

object of class Hidalgo, which is a list containing

cluster_prob: chains of the posterior mixture weights;
membership_labels: chains of the membership labels for all the observations;
id_raw: chains of the K intrinsic dimensions parameters, one per mixture component;
id_postpr: a chain for each observation, corrected for label switching;
id_summary: a matrix containing, for each observation, the value of posterior mean and the 5%, 25%, 50%, 75%, 95% quantiles;
recap: a list with the objects and specifications passed to the function used in the estimation.

References

Allegra M, Facco E, Denti F, Laio A, Mira A (2020). “Data segmentation based on the local intrinsic dimension.” Scientific Reports, 10(1), 1–27. ISSN 20452322, doi:10.1038/s41598-020-72222-0,

Santos-Fernandez E, Denti F, Mengersen K, Mira A (2021). “The role of intrinsic dimension in high-resolution player tracking data – Insights in basketball.” Annals of Applied Statistics - Forthcoming, – ISSN 2331-8422, 2002.04148, doi:10.1038/s41598-022-20991-1

Examples


set.seed(1234)
X            <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
X[1:250,]    <- X[1:250,] + 4
oracle       <- rep(1:2,rep(250,2))
# this is just a short example
# increase the number of iterations to improve mixing and convergence
h_out        <- Hidalgo(X, nsim = 500, burn_in = 500)
plot(h_out, type =  "B")
id_by_class(h_out, oracle)

[Package intRinsic version 1.0.2 Index]

Fit the Hidalgo model