bootstrap_persistence_thresholds {TDApplied} | R Documentation |
Estimate persistence threshold(s) for topological features in a data set using bootstrapping.
Description
Bootstrapping is used to find a conservative estimate of a 1-'alpha' percent "confidence interval" around each point in the persistence diagram of the data set, and points whose intervals do not touch the diagonal (birth == death) would be considered "significant" or "real". One threshold is computed for each dimension in the diagram.
Usage
bootstrap_persistence_thresholds(
X,
FUN_diag = "calculate_homology",
FUN_boot = "calculate_homology",
maxdim = 0,
thresh,
distance_mat = FALSE,
ripser = NULL,
ignore_infinite_cluster = TRUE,
calculate_representatives = FALSE,
num_samples = 30,
alpha = 0.05,
return_subsetted = FALSE,
return_pvals = FALSE,
return_diag = TRUE,
num_workers = parallelly::availableCores(omit = 1),
p_less_than_alpha = FALSE
)
Arguments
X |
the input dataset, must either be a matrix or data frame. |
FUN_diag |
a string representing the persistent homology function to use for calculating the full persistence diagram, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'. |
FUN_boot |
a string representing the persistent homology function to use for calculating the bootstrapped persistence diagrams, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'. |
maxdim |
the integer maximum homological dimension for persistent homology, default 0. |
thresh |
the positive numeric maximum radius of the Vietoris-Rips filtration. |
distance_mat |
a boolean representing if 'X' is a distance matrix (TRUE) or not (FALSE, default). dimensions together (TRUE, the default) or if one threshold should be calculated for each dimension separately (FALSE). |
ripser |
the imported ripser module when 'FUN_diag' or 'FUN_boot' is 'PyH'. |
ignore_infinite_cluster |
a boolean indicating whether or not to ignore the infinitely lived cluster when 'FUN_diag' or 'FUN_boot' is 'PyH'. |
calculate_representatives |
a boolean representing whether to calculate representative (co)cycles, default FALSE. Note that representatives cant be calculated when using the 'calculate_homology' function. |
num_samples |
the positive integer number of bootstrap samples, default 30. |
alpha |
the type-1 error threshold, default 0.05. |
return_subsetted |
a boolean representing whether or not to return the subsetted persistence diagram (with or without representatives), default FALSE. |
return_pvals |
a boolean representing whether or not to return p-values for features in the subsetted diagram, default FALSE. |
return_diag |
a boolean representing whether or not to return the calculated persistence diagram, default TRUE. |
num_workers |
the integer number of cores used for parallelizing (over bootstrap samples), default one less the maximum amount of cores on the machine. |
p_less_than_alpha |
a boolean representing whether or not subset further and return only feature whose p-values are strictly less than 'alpha', default 'FALSE'. Note that this is not part of the original bootstrap procedure. |
Details
The thresholds are then determined by calculating the 1-‘alpha’' percentile of the bottleneck
distance values between the real persistence diagram and other diagrams obtained
by bootstrap resampling the data. Since 'ripsDiag' is the slowest homology engine but is the
only engine which calculates representative cycles (as opposed to co-cycles with 'PyH'), two
homology engines are input to this function - one to calculate the actual persistence diagram, 'FUN_diag'
(possibly with representative (co)cycles) and one to calculate the bootstrap diagrams, 'FUN_boot' (this should be
a faster engine, like 'calculate_homology' or 'PyH').
p-values can be calculated for any feature which survives the thresholding if both 'return_subsetted' and 'return_pvals' are 'TRUE',
however these values may be larger than the original 'alpha' value in some cases. Note that this is not part of the original bootstrap procedure.
If stricter thresholding is desired,
or the p-values must be less than 'alpha', set 'p_less_than_alpha' to 'TRUE'. The minimum
possible p-value is always 1/('num_samples' + 1).
Note that since calculate_homology
can ignore the longest-lived cluster, fewer "real" clusters may be found. To avoid this possibility
try setting ‘FUN_diag' equal to ’ripsDiag'. Please note that due to the TDA package no longer being available on CRAN,
if ‘FUN_diag' or 'FUN_boot' are ’ripsDiag' then 'bootstrap_persistence_thresholds' will look for the ripsDiag function in the global environment,
so the TDA package should be attached with 'library("TDA")' prior to use.
Value
either a numeric vector of threshold values, with one for each dimension 0..'maxdim' (in that order), or a list containing those thresholds and elements (if desired)
Author(s)
Shael Brown - shaelebrown@gmail.com
References
Chazal F et al (2017). "Robust Topological Inference: Distance to a Measure and Kernel Distance." https://www.jmlr.org/papers/volume18/15-484/15-484.pdf.
Examples
if(require("TDAstats"))
{
# create a persistence diagram from a sample of the unit circle
df <- TDAstats::circle2d[sample(1:100,size = 50),]
# calculate persistence thresholds for alpha = 0.05
# and return the calculated diagram as well as the subsetted diagram
bootstrapped_diagram <- bootstrap_persistence_thresholds(X = df,
maxdim = 1,thresh = 2,num_workers = 2)
}