predict.isolation_forest {isotree}R Documentation

Predict method for Isolation Forest

Description

Predict method for Isolation Forest

Usage

## S3 method for class 'isolation_forest'
predict(
  object,
  newdata,
  type = "score",
  square_mat = ifelse(type == "kernel", TRUE, FALSE),
  refdata = NULL,
  use_reference_points = TRUE,
  nthreads = object$nthreads,
  ...
)

Arguments

object

An Isolation Forest object as returned by isolation.forest.

newdata

A 'data.frame', 'data.table', 'tibble', 'matrix', or sparse matrix (from package 'Matrix' or 'SparseM', CSC/dgCMatrix supported for outlierness, distance, kernels; CSR/dgRMatrix supported for outlierness and imputations) for which to predict outlierness, distance, kernels, or imputations of missing values.

If 'newdata' is sparse and one wants to obtain the outlier score or average depth or tree numbers, it's highly recommended to pass it in CSC ('dgCMatrix') format as it will be much faster when the number of trees or rows is large.

type

Type of prediction to output. Options are:

  • '"score"' for the standardized outlier score - for isolation-based metrics (the default), values closer to 1 indicate more outlierness, while values closer to 0.5 indicate average outlierness, and close to 0 more averageness (harder to isolate). For all scoring metrics, higher values indicate more outlierness.

  • '"avg_depth"' for the non-standardized average isolation depth or density or log-density. For 'scoring_metric="density"', will output the geometric mean instead. See the documentation for 'scoring_metric' for more details about the calculations for density-based metrics. For all scoring metrics, higher values indicate less outlierness.

  • '"dist"' for approximate pairwise or between-points distances (must pass more than 1 row) - these are standardized in the same way as outlierness, values closer to zero indicate nearer points, closer to one further away points, and closer to 0.5 average distance. To make this computation faster, it is highly recommended to build a node indexer with isotree.build.indexer (with 'with_distances=TRUE') before calling this function.

  • '"avg_sep"' for the non-standardized average separation depth. To make this computation faster, it is highly recommended to build a node indexer with isotree.build.indexer (with 'with_distances=TRUE') before calling this function.

  • '"kernel"' for pairwise or between-points isolation kernel calculations (also known as proximity matrix), which denotes the fraction of trees in which two observations end up in the same terminal node. This is typically not as good quality as the separation distance, but it's much faster to calculate, and has other potential uses - for example, this "kernel" can be used as an estimate of the correlations between residuals for a generalized least-squares regression, for which distance might not be as appropirate. Note that building an indexer will not speed up kernel/proximity calculations unless it has reference points. This calculation can be sped up significantly by setting reference points in the model object through isotree.set.reference.points, and it's highly recommended to do so if this calculation is going to be performed repeatedly.

  • '"kernel_raw"' for the isolation kernel or proximity matrix, but having as output the number of trees instead of the fraction of total trees.

  • '"tree_num"' for the terminal node number for each tree - if choosing this option, will return a list containing both the average isolation depth and the terminal node numbers, under entries ‘avg_depth' and 'tree_num', respectively. If this calculation is going to be perform frequently, it’s recommended to build node indices through isotree.build.indexer.

  • '"tree_depths"' for the non-standardized isolation depth or expected isolation depth or density or log-density for each tree (note that they will not include range penalties from 'penalize_range=TRUE'). See the documentation for 'scoring_metric' for more details about the calculations for density-based metrics.

  • '"impute"' for imputation of missing values in 'newdata'.

square_mat

When passing 'type' = '"dist' or '"avg_sep"' or '"kernel"' or '"kernel_raw"' with no 'refdata', whether to return a full square matrix (returned as a numeric 'matrix' object) or just its upper-triangular part (returned as a 'dist' object and compatible with functions such as 'hclust'), in which the entry for pair (i,j) with 1 <= i < j <= n is located at position p(i, j) = ((i - 1) * (n - i/2) + j - i).

Ignored when not predicting distance/separation/kernels or when passing 'refdata' or 'use_reference_points=TRUE' plus having reference points.

refdata

If passing this and calculating distances or average separation depths or kernels, will calculate distances between each point in 'newdata' and each point in 'refdata', outputing a matrix in which points in 'newdata' correspond to rows and points in 'refdata' correspond to columns. Must be of the same type as 'newdata' (e.g. 'data.frame', 'matrix', 'dgCMatrix', etc.). If this is not passed, and type is '"dist"' or '"avg_sep"' or '"kernel"' or '"kernel_raw"', will calculate pairwise distances/separation between the points in 'newdata'.

Note that, if 'refdata' is passed and and the model object has an indexer with reference points added (through isotree.set.reference.points), those reference points will be ignored for the calculation.

use_reference_points

When the model object has an indexer with reference points (which can be added through isotree.set.reference.points) and passing 'type="dist"' or '"avg_sep"' or '"kernel"' or '"kernel_raw"', whether to calculate the distances/kernels from 'newdata' to those reference points instead of the pairwise distances between points in 'newdata'.

This is ignored when passing 'refdata' or when the model object does not contain an indexer or the indexer does not contain reference points.

nthreads

Number of parallel threads to use. Note: for better performance, it's recommended to set the number of threads to the number of physical CPU cores, which in a typical desktop CPU, corresponds to half the number of threads (see details for more information).

Shorthand for best performance: 'nthreads = RhpcBLASctl::get_num_cores()'

...

Not used.

Details

The standardized outlier score for isolation-based metrics is calculated according to the original paper's formula: 2^{ - \frac{\bar{d}}{c(n)} } , where \bar{d} is the average depth under each tree at which an observation becomes isolated (a remainder is extrapolated if the actual terminal node is not isolated), and c(n) is the expected isolation depth if observations were uniformly random (see references under isolation.forest for details). The actual calculation of c(n) however differs from the paper as this package uses more exact procedures for calculation of harmonic numbers.

For density-based matrics, see the documentation for 'scoring_metric' in isolation.forest for details about the score calculations.

The distribution of outlier scores for isolation-based metrics should be centered around 0.5, unless using non-random splits (parameters 'prob_pick_avg_gain', 'prob_pick_pooled_gain', 'prob_pick_full_gain', 'prob_pick_dens') and/or range penalizations, or having distributions which are too skewed. For 'scoring_metric="density"', most of the values should be negative, and while zero can be used as a natural score threshold, the scores are unlikely to be centered around zero.

The more threads that are set for the model, the higher the memory requirement will be as each thread will allocate an array with one entry per row (outlierness) or combination (distance), with an exception being calculation of distances/kernels to reference points, which do not do this.

For multi-threaded predictions on many rows, it is recommended to set the number of threads to the number of physical cores of the CPU rather than the number of logical cores, as it will typically have better performance that way. Assuming a typical x86-64 desktop CPU, this typically involves dividing the number of threads by 2 - for example: 'model$nthreads <- RhpcBLASctl::get_num_cores()'

Outlierness predictions for sparse data will be much slower than for dense data. Not recommended to pass sparse matrices unless they are too big to fit in memory.

Note that after loading a serialized object from 'isolation.forest' through 'readRDS' or 'load', if it was constructed with 'lazy_serialization=FALSE' it will only de-serialize the underlying C++ object upon running 'predict', 'print', or 'summary', so the first run will be slower, while subsequent runs will be faster as the C++ object will already be in-memory. This does not apply when using 'lazy_serialization=TRUE'.

In order to save memory when fitting and serializing models, the functionality for outputting terminal node numbers will generate index mappings on the fly for all tree nodes, even if passing only 1 row, so it's only recommended for batch predictions. If this type of prediction is desired, it can be sped up by building an index of terminal nodes through isotree.build.indexer, which will avoid having to recompute these every time.

The outlier scores/depth predict functionality is optimized for making predictions on one or a few rows at a time - for making large batches of predictions, it might be faster to use the option 'output_score=TRUE' in 'isolation.forest'.

When making predictions on CSC matrices with many rows using multiple threads, there can be small differences between runs due to roundoff error.

When imputing missing values, the input may contain new columns (i.e. not present when the model was fitted), which will be output as-is.

If passing 'type="dist"' or 'type="avg_sep"', by default, it will do the calculation through a procedure that counts steps as observations are passed down the trees, which is especially slow and not recommended for more than a few thousand observations. If this calculation is going to be called repeatedly and/or it is going to be called for a large number of rows, it's highly recommended to build node distance indexes beforehand through isotree.build.indexer with option 'with_distances=TRUE', as then the computation will be done based on terminal node indices instead, which is a much faster procedure. If distance calculations are all going to be performed with respect to a fixed set of points, it's highly recommended to set those points as references through isotree.set.reference.points.

If using 'assume_full_distr=FALSE' (not recommended to use such option), distance predictions with and without an indexer will differ slightly due to differences in what they count towards "additional" observations in the calculation.

Value

The requested prediction type, which can be:

Model serving considerations

If the model is built with 'nthreads>1', the prediction function predict.isolation_forest will use OpenMP for parallelization. In a linux setup, one usually has GNU's "gomp" as OpenMP as backend, which will hang when used in a forked process - for example, if one tries to call this prediction function from 'RestRserve', which uses process forking for parallelization, it will cause the whole application to freeze. A potential fix in these cases is to pass 'nthreads=1' to 'predict', or to set the number of threads to 1 in the model object (e.g. 'model$nthreads <- 1L' or calling isotree.set.nthreads), or to compile this library without OpenMP (requires manually altering the 'Makevars' file), or to use a non-GNU OpenMP backend (such as LLVM's 'libomp'. This should not be an issue when using this library normally in e.g. an RStudio session.

The R objects that hold the models contain heap-allocated C++ objects which do not map to R types and which thus do not survive serializations the same way R objects do. In order to make model objects serializable (i.e. usable with 'save', 'saveRDS', and similar), the package offers two mechanisms: (a) a 'lazy_serialization' option which uses the ALTREP system as a workaround, by defining classes with serialization methods but without datapointer methods (see the docs for 'lazy_serialization' for more info); (b) a more theoretically correct way in which raw bytes are produced alongside the model and from which the C++ objects can be reconstructed. When using the lazy serialization system, C++ objects are restored automatically on load and the serialized bytes then discarded, but this is not the case when using the serialized bytes approach. For model serving, one would usually want to drop these serialized bytes after having loaded a model through 'readRDS' or similar (note that reconstructing the C++ object will first require calling isotree.restore.handle, which is done automatically when calling 'predict' and similar), as they can increase memory usage by a large amount. These redundant raw bytes can be dropped as follows: 'model$cpp_objects$model$ser <- NULL' (and an additional 'model$cpp_objects$imputer$ser <- NULL' when using 'build_imputer=TRUE', and 'model$cpp_objects$indexer$ser <- NULL' when building a node indexer). After that, one might want to force garbage collection through 'gc()'.

Usually, for serving purposes, one wants a setup as minimalistic as possible (e.g. smaller docker images). This library can be made smaller and faster to compile by disabling some features - particularly, the library will by default build with support for calculation of aggregated metrics (such as standard deviations) in 'long double' precision (an extended precision type), which is a functionality that's unlikely to get used (default is not to use this type as it is slower, and calculations done in the ‘predict' function do not use it for anything). Support for ’long double' can be disable at compile time by setting up an environment variable 'NO_LONG_DOUBLE' before installing the package (e.g. by issuing command 'Sys.setenv("NO_LONG_DOUBLE" = "1")' before 'install.packages').

See Also

isolation.forest isotree.restore.handle isotree.build.indexer isotree.set.reference.points


[Package isotree version 0.6.1-1 Index]