diagram_ksvm {TDApplied} | R Documentation |
Fit a support vector machine model where each training set instance is a persistence diagram.
Description
Returns the output of kernlab's ksvm
function on the Gram matrix of the list of persistence diagrams
in a particular dimension.
Usage
diagram_ksvm(
diagrams,
cv = 1,
dim,
t = 1,
sigma = 1,
rho = NULL,
y,
type = NULL,
distance_matrices = NULL,
C = 1,
nu = 0.2,
epsilon = 0.1,
prob.model = FALSE,
class.weights = NULL,
fit = TRUE,
cache = 40,
tol = 0.001,
shrinking = TRUE,
num_workers = parallelly::availableCores(omit = 1)
)
Arguments
diagrams |
a list of persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/ |
cv |
a positive number at most the length of 'diagrams' which determines the number of cross validation splits to be performed (default 1, aka no cross-validation). If 'prob.model' is TRUE then cv is set to 1 since kernlab performs 3-fold CV internally in this case. When performing classification, classes are balanced within each cv fold. |
dim |
a non-negative integer vector of homological dimensions in which the model is to be fit. |
t |
either a vector of positive numbers representing the grid of values for the scale of the persistence Fisher kernel or NULL, default 1. If NULL then t is selected automatically, see details. |
sigma |
a vector of positive numbers representing the grid of values for the bandwidth of the Fisher information metric, default 1. |
rho |
an optional positive number representing the heuristic for Fisher information metric approximation, see |
y |
a response vector with one label for each persistence diagram. Must be either numeric or factor, but doesn't need to be supplied when 'type' is "one-svc". |
type |
a string representing the type of task to be performed. Can be any one of "C-svc","nu-svc","one-svc","eps-svr","nu-svr" - default for regression is "eps-svr" and for classification is "C-svc". See |
distance_matrices |
an optional list of precomputed Fisher distance matrices, corresponding to the rows in 'expand.grid(dim = dim,sigma = sigma)', default NULL. |
C |
a number representing the cost of constraints violation (default 1) this is the 'C'-constant of the regularization term in the Lagrange formulation. |
nu |
numeric parameter needed for nu-svc, one-svc and nu-svr. The 'nu' parameter sets the upper bound on the training error and the lower bound on the fraction of data points to become Support Vector (default 0.2). |
epsilon |
epsilon in the insensitive-loss function used for eps-svr, nu-svr and eps-bsvm (default 0.1). |
prob.model |
if set to TRUE builds a model for calculating class probabilities or in case of regression, calculates the scaling parameter of the Laplacian distribution fitted on the residuals. Fitting is done on output data created by performing a 3-fold cross-validation on the training data. For details see references (default FALSE). |
class.weights |
a named vector of weights for the different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1). All components have to be named. |
fit |
indicates whether the fitted values should be computed and included in the model or not (default TRUE). |
cache |
cache memory in MB (default 40). |
tol |
tolerance of termination criteria (default 0.001). |
shrinking |
option whether to use the shrinking-heuristics (default TRUE). |
num_workers |
the number of cores used for parallel computation, default is one less the number of cores on the machine. |
Details
Cross validation is carried out in parallel, using a trick
noted in doi: 10.1007/s41468-017-0008-7 - since the persistence Fisher kernel can be
written as d_{PF}(D_1,D_2)=exp(t*d_{FIM}(D_1,D_2))=exp(d_{FIM}(D_1,D_2))^t
, we can
store the Fisher information metric distance matrix for each sigma value in the parameter grid to avoid
recomputing distances, and cross validation is therefore performed in parallel.
Note that the response parameter 'y' must be a factor for classification -
a character vector for instance will throw an error. If 't' is NULL then 1/'t' is selected as
the 1,2,5,10,20,50 percentiles of the upper triangle of the distance matrix of its training sample (per fold in the case of cross-validation).
This is the process suggested in the persistence Fisher kernel paper. If
any of these values would divide by 0 (i.e. if the training set is small) then the minimum non-zero element
is taken as the denominator (and hence the returned parameters may have duplicate rows except for differing error values). If
cross-validation is performed then the mean error across folds is still recorded, but the best 't' parameter
across all folds is recorded in the cv results table.
Value
a list of class 'diagram_ksvm' containing the elements
- cv_results
the cross-validation results - a matrix storing the parameters for each model in the tuning grid and its mean cross-validation error over all splits.
- best_model
a list containing the output of
ksvm
run on the whole dataset with the optimal model parameters found during cross-validation, as well as the optimal kernel parameters for the model.- diagrams
the diagrams which were supplied in the function call.
Author(s)
Shael Brown - shaelebrown@gmail.com
References
Murphy, K. "Machine learning: a probabilistic perspective." MIT press (2012).
See Also
predict_diagram_ksvm
for predicting labels of new diagrams.
Examples
if(require("TDAstats"))
{
# create four diagrams
D1 <- TDAstats::calculate_homology(TDAstats::circle2d[sample(1:100,20),],
dim = 1,threshold = 2)
D2 <- TDAstats::calculate_homology(TDAstats::circle2d[sample(1:100,20),],
dim = 1,threshold = 2)
D3 <- TDAstats::calculate_homology(TDAstats::sphere3d[sample(1:100,20),],
dim = 1,threshold = 2)
D4 <- TDAstats::calculate_homology(TDAstats::sphere3d[sample(1:100,20),],
dim = 1,threshold = 2)
g <- list(D1,D2,D3,D4)
# create response vector
y <- as.factor(c("circle","circle","sphere","sphere"))
# fit model without cross validation
model_svm <- diagram_ksvm(diagrams = g,cv = 1,dim = c(0),
y = y,sigma = c(1),t = c(1),
num_workers = 2)
}