bootstrap_persistence_thresholds {TDApplied}R Documentation

Estimate persistence threshold(s) for topological features in a data set using bootstrapping.

Description

Bootstrapping is used to find a conservative estimate of a 1-'alpha' percent "confidence interval" around each point in the persistence diagram of the data set, and points whose intervals do not touch the diagonal (birth == death) would be considered "significant" or "real". One threshold is computed for each dimension in the diagram.

Usage

bootstrap_persistence_thresholds(
  X,
  FUN_diag = "calculate_homology",
  FUN_boot = "calculate_homology",
  maxdim = 0,
  thresh,
  distance_mat = FALSE,
  ripser = NULL,
  ignore_infinite_cluster = TRUE,
  calculate_representatives = FALSE,
  num_samples = 30,
  alpha = 0.05,
  return_subsetted = FALSE,
  return_pvals = FALSE,
  return_diag = TRUE,
  num_workers = parallelly::availableCores(omit = 1),
  p_less_than_alpha = FALSE
)

Arguments

X

the input dataset, must either be a matrix or data frame.

FUN_diag

a string representing the persistent homology function to use for calculating the full persistence diagram, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.

FUN_boot

a string representing the persistent homology function to use for calculating the bootstrapped persistence diagrams, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.

maxdim

the integer maximum homological dimension for persistent homology, default 0.

thresh

the positive numeric maximum radius of the Vietoris-Rips filtration.

distance_mat

a boolean representing if 'X' is a distance matrix (TRUE) or not (FALSE, default). dimensions together (TRUE, the default) or if one threshold should be calculated for each dimension separately (FALSE).

ripser

the imported ripser module when 'FUN_diag' or 'FUN_boot' is 'PyH'.

ignore_infinite_cluster

a boolean indicating whether or not to ignore the infinitely lived cluster when 'FUN_diag' or 'FUN_boot' is 'PyH'.

calculate_representatives

a boolean representing whether to calculate representative (co)cycles, default FALSE. Note that representatives cant be calculated when using the 'calculate_homology' function.

num_samples

the positive integer number of bootstrap samples, default 30.

alpha

the type-1 error threshold, default 0.05.

return_subsetted

a boolean representing whether or not to return the subsetted persistence diagram (with or without representatives), default FALSE.

return_pvals

a boolean representing whether or not to return p-values for features in the subsetted diagram, default FALSE.

return_diag

a boolean representing whether or not to return the calculated persistence diagram, default TRUE.

num_workers

the integer number of cores used for parallelizing (over bootstrap samples), default one less the maximum amount of cores on the machine.

p_less_than_alpha

a boolean representing whether or not subset further and return only feature whose p-values are strictly less than 'alpha', default 'FALSE'. Note that this is not part of the original bootstrap procedure.

Details

The thresholds are then determined by calculating the 1-‘alpha’' percentile of the bottleneck distance values between the real persistence diagram and other diagrams obtained by bootstrap resampling the data. Since 'ripsDiag' is the slowest homology engine but is the only engine which calculates representative cycles (as opposed to co-cycles with 'PyH'), two homology engines are input to this function - one to calculate the actual persistence diagram, 'FUN_diag' (possibly with representative (co)cycles) and one to calculate the bootstrap diagrams, 'FUN_boot' (this should be a faster engine, like 'calculate_homology' or 'PyH'). p-values can be calculated for any feature which survives the thresholding if both 'return_subsetted' and 'return_pvals' are 'TRUE', however these values may be larger than the original 'alpha' value in some cases. Note that this is not part of the original bootstrap procedure. If stricter thresholding is desired, or the p-values must be less than 'alpha', set 'p_less_than_alpha' to 'TRUE'. The minimum possible p-value is always 1/('num_samples' + 1). Note that since calculate_homology can ignore the longest-lived cluster, fewer "real" clusters may be found. To avoid this possibility try setting ‘FUN_diag' equal to ’ripsDiag'. Please note that due to the TDA package no longer being available on CRAN, if ‘FUN_diag' or 'FUN_boot' are ’ripsDiag' then 'bootstrap_persistence_thresholds' will look for the ripsDiag function in the global environment, so the TDA package should be attached with 'library("TDA")' prior to use.

Value

either a numeric vector of threshold values, with one for each dimension 0..'maxdim' (in that order), or a list containing those thresholds and elements (if desired)

Author(s)

Shael Brown - shaelebrown@gmail.com

References

Chazal F et al (2017). "Robust Topological Inference: Distance to a Measure and Kernel Distance." https://www.jmlr.org/papers/volume18/15-484/15-484.pdf.

Examples


if(require("TDAstats"))
{
  # create a persistence diagram from a sample of the unit circle
  df <- TDAstats::circle2d[sample(1:100,size = 50),]

  # calculate persistence thresholds for alpha = 0.05 
  # and return the calculated diagram as well as the subsetted diagram
  bootstrapped_diagram <- bootstrap_persistence_thresholds(X = df,
  maxdim = 1,thresh = 2,num_workers = 2)
}

[Package TDApplied version 3.0.3 Index]