predict.isolation_forest {isotree} | R Documentation |
Predict method for Isolation Forest
Description
Predict method for Isolation Forest
Usage
## S3 method for class 'isolation_forest'
predict(
object,
newdata,
type = "score",
square_mat = ifelse(type == "kernel", TRUE, FALSE),
refdata = NULL,
use_reference_points = TRUE,
nthreads = object$nthreads,
...
)
Arguments
object |
An Isolation Forest object as returned by isolation.forest. |
newdata |
A 'data.frame', 'data.table', 'tibble', 'matrix', or sparse matrix (from package 'Matrix' or 'SparseM', CSC/dgCMatrix supported for outlierness, distance, kernels; CSR/dgRMatrix supported for outlierness and imputations) for which to predict outlierness, distance, kernels, or imputations of missing values. If 'newdata' is sparse and one wants to obtain the outlier score or average depth or tree numbers, it's highly recommended to pass it in CSC ('dgCMatrix') format as it will be much faster when the number of trees or rows is large. |
type |
Type of prediction to output. Options are:
|
square_mat |
When passing 'type' = '"dist' or '"avg_sep"' or '"kernel"' or '"kernel_raw"' with no 'refdata', whether to return a full square matrix (returned as a numeric 'matrix' object) or just its upper-triangular part (returned as a 'dist' object and compatible with functions such as 'hclust'), in which the entry for pair (i,j) with 1 <= i < j <= n is located at position p(i, j) = ((i - 1) * (n - i/2) + j - i). Ignored when not predicting distance/separation/kernels or when passing 'refdata' or 'use_reference_points=TRUE' plus having reference points. |
refdata |
If passing this and calculating distances or average separation depths or kernels, will calculate distances between each point in 'newdata' and each point in 'refdata', outputing a matrix in which points in 'newdata' correspond to rows and points in 'refdata' correspond to columns. Must be of the same type as 'newdata' (e.g. 'data.frame', 'matrix', 'dgCMatrix', etc.). If this is not passed, and type is '"dist"' or '"avg_sep"' or '"kernel"' or '"kernel_raw"', will calculate pairwise distances/separation between the points in 'newdata'. Note that, if 'refdata' is passed and and the model object has an indexer with reference points added (through isotree.set.reference.points), those reference points will be ignored for the calculation. |
use_reference_points |
When the model object has an indexer with reference points (which can be added through isotree.set.reference.points) and passing 'type="dist"' or '"avg_sep"' or '"kernel"' or '"kernel_raw"', whether to calculate the distances/kernels from 'newdata' to those reference points instead of the pairwise distances between points in 'newdata'. This is ignored when passing 'refdata' or when the model object does not contain an indexer or the indexer does not contain reference points. |
nthreads |
Number of parallel threads to use. Note: for better performance, it's recommended to set the number of threads to the number of physical CPU cores, which in a typical desktop CPU, corresponds to half the number of threads (see details for more information). Shorthand for best performance: 'nthreads = RhpcBLASctl::get_num_cores()' |
... |
Not used. |
Details
The standardized outlier score for isolation-based metrics is calculated according to the
original paper's formula:
2^{ - \frac{\bar{d}}{c(n)} }
, where
\bar{d}
is the average depth under each tree at which an observation
becomes isolated (a remainder is extrapolated if the actual terminal node is not isolated),
and c(n)
is the expected isolation depth if observations were uniformly random
(see references under isolation.forest for details). The actual calculation
of c(n)
however differs from the paper as this package uses more exact procedures
for calculation of harmonic numbers.
For density-based matrics, see the documentation for 'scoring_metric' in isolation.forest for details about the score calculations.
The distribution of outlier scores for isolation-based metrics should be centered around 0.5, unless using non-random splits (parameters 'prob_pick_avg_gain', 'prob_pick_pooled_gain', 'prob_pick_full_gain', 'prob_pick_dens') and/or range penalizations, or having distributions which are too skewed. For 'scoring_metric="density"', most of the values should be negative, and while zero can be used as a natural score threshold, the scores are unlikely to be centered around zero.
The more threads that are set for the model, the higher the memory requirement will be as each thread will allocate an array with one entry per row (outlierness) or combination (distance), with an exception being calculation of distances/kernels to reference points, which do not do this.
For multi-threaded predictions on many rows, it is recommended to set the number of threads to the number of physical cores of the CPU rather than the number of logical cores, as it will typically have better performance that way. Assuming a typical x86-64 desktop CPU, this typically involves dividing the number of threads by 2 - for example: 'model$nthreads <- RhpcBLASctl::get_num_cores()'
Outlierness predictions for sparse data will be much slower than for dense data. Not recommended to pass sparse matrices unless they are too big to fit in memory.
Note that after loading a serialized object from 'isolation.forest' through 'readRDS' or 'load', if it was constructed with 'lazy_serialization=FALSE' it will only de-serialize the underlying C++ object upon running 'predict', 'print', or 'summary', so the first run will be slower, while subsequent runs will be faster as the C++ object will already be in-memory. This does not apply when using 'lazy_serialization=TRUE'.
In order to save memory when fitting and serializing models, the functionality for outputting terminal node numbers will generate index mappings on the fly for all tree nodes, even if passing only 1 row, so it's only recommended for batch predictions. If this type of prediction is desired, it can be sped up by building an index of terminal nodes through isotree.build.indexer, which will avoid having to recompute these every time.
The outlier scores/depth predict functionality is optimized for making predictions on one or a few rows at a time - for making large batches of predictions, it might be faster to use the option 'output_score=TRUE' in 'isolation.forest'.
When making predictions on CSC matrices with many rows using multiple threads, there can be small differences between runs due to roundoff error.
When imputing missing values, the input may contain new columns (i.e. not present when the model was fitted), which will be output as-is.
If passing 'type="dist"' or 'type="avg_sep"', by default, it will do the calculation through a procedure that counts steps as observations are passed down the trees, which is especially slow and not recommended for more than a few thousand observations. If this calculation is going to be called repeatedly and/or it is going to be called for a large number of rows, it's highly recommended to build node distance indexes beforehand through isotree.build.indexer with option 'with_distances=TRUE', as then the computation will be done based on terminal node indices instead, which is a much faster procedure. If distance calculations are all going to be performed with respect to a fixed set of points, it's highly recommended to set those points as references through isotree.set.reference.points.
If using 'assume_full_distr=FALSE' (not recommended to use such option), distance predictions with and without an indexer will differ slightly due to differences in what they count towards "additional" observations in the calculation.
Value
The requested prediction type, which can be:
A numeric vector with one entry per row in 'newdata' (for output types '"score"' and '"avg_depth"').
An integer matrix with number of rows matching to rows in 'newdata' and number of columns matching to the number of trees in the model, indicating the terminal node number under each tree for each observation, with trees as columns, for output type '"tree_num"'.
A numeric matrix with rows matching to those in 'newdata' and one column per tree in the model, for output type '"tree_depths"'.
A numeric square matrix or 'dist' object which consists of a vector with the upper triangular part of a square matrix, (for output types '"dist"', '"avg_sep"', '"kernel"', '"kernel_raw"'; with no 'refdata' and no reference points or 'use_reference_points=FALSE').
A numeric matrix with points in 'newdata' as rows and points in 'refdata' as columns (for output types '"dist"', '"avg_sep"', '"kernel"', '"kernel_raw"'; with 'refdata').
A numeric matrix with points in 'newdata' as rows and reference points set through isotree.set.reference.points as columns (for output types '"dist"', '"avg_sep"', '"kernel"', '"kernel_raw"'; with 'use_reference_points=TRUE' and no 'refdata').
The same type as the input 'newdata' (for output type '"impute"').
Model serving considerations
If the model is built with 'nthreads>1', the prediction function predict.isolation_forest will use OpenMP for parallelization. In a linux setup, one usually has GNU's "gomp" as OpenMP as backend, which will hang when used in a forked process - for example, if one tries to call this prediction function from 'RestRserve', which uses process forking for parallelization, it will cause the whole application to freeze. A potential fix in these cases is to pass 'nthreads=1' to 'predict', or to set the number of threads to 1 in the model object (e.g. 'model$nthreads <- 1L' or calling isotree.set.nthreads), or to compile this library without OpenMP (requires manually altering the 'Makevars' file), or to use a non-GNU OpenMP backend (such as LLVM's 'libomp'. This should not be an issue when using this library normally in e.g. an RStudio session.
The R objects that hold the models contain heap-allocated C++ objects which do not map to R types and which thus do not survive serializations the same way R objects do. In order to make model objects serializable (i.e. usable with 'save', 'saveRDS', and similar), the package offers two mechanisms: (a) a 'lazy_serialization' option which uses the ALTREP system as a workaround, by defining classes with serialization methods but without datapointer methods (see the docs for 'lazy_serialization' for more info); (b) a more theoretically correct way in which raw bytes are produced alongside the model and from which the C++ objects can be reconstructed. When using the lazy serialization system, C++ objects are restored automatically on load and the serialized bytes then discarded, but this is not the case when using the serialized bytes approach. For model serving, one would usually want to drop these serialized bytes after having loaded a model through 'readRDS' or similar (note that reconstructing the C++ object will first require calling isotree.restore.handle, which is done automatically when calling 'predict' and similar), as they can increase memory usage by a large amount. These redundant raw bytes can be dropped as follows: 'model$cpp_objects$model$ser <- NULL' (and an additional 'model$cpp_objects$imputer$ser <- NULL' when using 'build_imputer=TRUE', and 'model$cpp_objects$indexer$ser <- NULL' when building a node indexer). After that, one might want to force garbage collection through 'gc()'.
Usually, for serving purposes, one wants a setup as minimalistic as possible (e.g. smaller docker images). This library can be made smaller and faster to compile by disabling some features - particularly, the library will by default build with support for calculation of aggregated metrics (such as standard deviations) in 'long double' precision (an extended precision type), which is a functionality that's unlikely to get used (default is not to use this type as it is slower, and calculations done in the ‘predict' function do not use it for anything). Support for ’long double' can be disable at compile time by setting up an environment variable 'NO_LONG_DOUBLE' before installing the package (e.g. by issuing command 'Sys.setenv("NO_LONG_DOUBLE" = "1")' before 'install.packages').
See Also
isolation.forest isotree.restore.handle isotree.build.indexer isotree.set.reference.points