isolation.forest {isotree} | R Documentation |
Create Isolation Forest Model
Description
Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest and FCF models that aim at isolating outliers faster and/or finding clustered outliers.
This version adds heuristics to handle missing data and categorical variables. Can be used to aproximate pairwise distances by checking the depth after which two observations become separated, and to approximate densities by fitting trees beyond balanced-tree limit. Offers options to vary between randomized and deterministic splits too.
Important: The default parameters in this software do not correspond to the suggested parameters in any of the references (see section "Matching models from references"). In particular, the following default values are likely to cause huge differences when compared to the defaults in other software: 'ndim', 'sample_size', 'ntrees'. The defaults here are nevertheless more likely to result in better models. In order to mimic the Python library "scikit-learn" for example, one would need to pass 'ndim=1', 'sample_size=256', 'ntrees=100', 'missing_action="fail"', 'nthreads=1'.
Note that the default parameters will not scale to large datasets. In particular, if the amount of data is large, it's suggested to set a smaller sample size for each tree (parameter 'sample_size'), and to fit fewer of them (parameter 'ntrees'). As well, the default option for 'missing_action' might slow things down significantly (see below for details). These defaults can also result in very big model sizes in memory and as serialized files (e.g. models that weight over 10GB) when the number of rows in the data is large. Using fewer trees, smaller sample sizes, and shallower trees can help to reduce model sizes if that becomes a problem.
The model offers many tunable parameters (see reference [11] for a comparison). The most likely candidate to tune is 'prob_pick_pooled_gain', for which higher values tend to result in a better ability to flag outliers in multimodal datasets, at the expense of poorer generalizability to inputs with values outside the variables' ranges to which the model was fit (see plots generated from the examples for a better idea of the difference). The next candidate to tune is 'sample_size' - the default is to use all rows, but in some datasets introducing sub-sampling can help, especially for the single-variable model. In smaller datasets, one might also want to experiment with 'weigh_by_kurtosis' and perhaps lower 'ndim'.If using 'prob_pick_pooled_gain', models are likely to benefit from deeper trees (controlled by 'max_depth'), but using large samples and/or deeper trees can result in significantly slower model fitting and predictions - in such cases, using 'min_gain' (with a value like 0.25) with 'max_depth=NULL' can offer a better speed/performance trade-off than changing 'max_depth'.
If the data has categorical variables and these are more important important for determining outlierness compared to numerical columns, one might want to experiment with 'ndim=1', 'categ_split_type="single_categ"', and 'scoring_metric="density"'; while for all-numeric datasets - especially if there are missing values - one might want to experiment with 'ndim=2' or 'ndim=3'.
For small datasets, one might also want to experiment with 'ndim=1', 'scoring_metric="adj_depth"' and 'penalize_range=TRUE'.
Usage
isolation.forest(
data,
sample_size = min(nrow(data), 10000L),
ntrees = 500,
ndim = 1,
ntry = 1,
categ_cols = NULL,
max_depth = ceiling(log2(sample_size)),
ncols_per_tree = ncol(data),
prob_pick_pooled_gain = 0,
prob_pick_avg_gain = 0,
prob_pick_full_gain = 0,
prob_pick_dens = 0,
prob_pick_col_by_range = 0,
prob_pick_col_by_var = 0,
prob_pick_col_by_kurt = 0,
min_gain = 0,
missing_action = ifelse(ndim > 1, "impute", "divide"),
new_categ_action = ifelse(ndim > 1, "impute", "weighted"),
categ_split_type = ifelse(ndim > 1, "subset", "single_categ"),
all_perm = FALSE,
coef_by_prop = FALSE,
recode_categ = FALSE,
weights_as_sample_prob = TRUE,
sample_with_replacement = FALSE,
penalize_range = FALSE,
standardize_data = TRUE,
scoring_metric = "depth",
fast_bratio = TRUE,
weigh_by_kurtosis = FALSE,
coefs = "uniform",
assume_full_distr = TRUE,
build_imputer = FALSE,
output_imputations = FALSE,
min_imp_obs = 3,
depth_imp = "higher",
weigh_imp_rows = "inverse",
output_score = FALSE,
output_dist = FALSE,
square_dist = FALSE,
sample_weights = NULL,
column_weights = NULL,
lazy_serialization = TRUE,
seed = 1,
use_long_double = FALSE,
nthreads = parallel::detectCores()
)
Arguments
data |
Data to which to fit the model. Supported inputs type are:
If passing a 'data.frame', will assume that columns are:
Other input and column types are not supported. |
sample_size |
Sample size of the data sub-samples with which each binary tree will be built. Recommended value in references [1], [2], [3], [4] is 256, while the default value in the author's code in reference [5] is 'nrow(data)'. If passing 'NULL', will take the full number of rows in the data (no sub-sampling). If passing a number between zero and one, will assume it means taking a sample size that represents that proportion of the rows in the data. Note that sub-sampling is incompatible with 'output_score', 'output_dist', and 'output_imputations', and if any of those options is requested, 'sample_size' will be overriden. Hint: seeing a distribution of scores which is on average too far below 0.5 could mean that the model needs more trees and/or bigger samples to reach convergence (unless using non-random splits, in which case the distribution is likely to be centered around a much lower number), or that the distributions in the data are too skewed for random uniform splits. |
ntrees |
Number of binary trees to build for the model. Recommended value in reference [1] is 100, while the default value in the author's code in reference [5] is 10. In general, the number of trees required for good results is higher when (a) there are many columns, (b) there are categorical variables, (c) categorical variables have many categories, (d) 'ndim' is high, (e) 'prob_pick_pooled_gain' is used, (f) 'scoring_metric="density"' or 'scoring_metric="boxed_density"' are used. Hint: seeing a distribution of scores which is on average too far below 0.5 could mean that the model needs more trees and/or bigger samples to reach convergence (unless using non-random splits, in which case the distribution is likely to be centered around a much lower number), or that the distributions in the data are too skewed for random uniform splits. |
ndim |
Number of columns to combine to produce a split. If passing 1, will produce the single-variable model described in references [1] and [2], while if passing values greater than 1, will produce the extended model described in references [3] and [4]. Recommended value in reference [4] is 2, while [3] recommends a low value such as 2 or 3. Models with values higher than 1 are referred hereafter as the extended model (as in reference [3]). If passing 'NULL', will assume it means using the full number of columns in the data. Note that, when using 'ndim>1' plus 'standardize_data=TRUE', the variables are standardized at each step as suggested in [4], which makes the models slightly different than in [3]. In general, when the data has categorical variables, models with 'ndim=1' plus 'categ_split_type="single_categ"' tend to produce better results, while models 'ndim>1' tend to produce better results for numerical-only data, especially in the presence of missing values. |
ntry |
When using any of 'prob_pick_pooled_gain', 'prob_pick_avg_gain', 'prob_pick_full_gain', 'prob_pick_dens', how many variables (with 'ndim=1') or linear combinations (with 'ndim>1') to try for determining the best one according to gain. Recommended value in reference [4] is 10 (with 'prob_pick_avg_gain', for outlier detection), while the recommended value in reference [11] is 1 (with 'prob_pick_pooled_gain', for outlier detection), and the recommended value in reference [9] is 10 to 20 (with 'prob_pick_pooled_gain', for missing value imputations). |
categ_cols |
Columns that hold categorical features, when the data is passed as a matrix (either dense or sparse). Can be passed as an integer vector (numeration starting at 1) denoting the indices of the columns that are categorical, or as a character vector denoting the names of the columns that are categorical, assuming that 'data' has column names. Categorical columns should contain only integer values with a continuous numeration starting at zero
(not at one as is typical in R packages), and with negative values and NA/NaN taken as missing.
The maximum categorical value should not exceed '.Machine$integer.max' (typically This is ignored when the input is passed as a 'data.frame' as then it will consider columns as categorical depending on their type/class (see the documentation for 'data' for details). |
max_depth |
Maximum depth of the binary trees to grow. By default, will limit it to the corresponding depth of a balanced binary tree with number of terminal nodes corresponding to the sub-sample size (the reason being that, if trying to detect outliers, an outlier will only be so if it turns out to be isolated with shorter average depth than usual, which corresponds to a balanced tree depth). When a terminal node has more than 1 observation, the remaining isolation depth for them is estimated assuming the data and splits are both uniformly random (separation depth follows a similar process with expected value calculated as in reference [6]). Default setting for references [1], [2], [3], [4] is the same as the default here, but it's recommended to pass higher values if using the model for purposes other than outlier detection. If passing 'NULL' or zero, will not limit the depth of the trees (that is, will grow them until each observation is isolated or until no further split is possible). Note that models that use 'prob_pick_pooled_gain' or 'prob_pick_avg_gain' are likely to benefit from deeper trees (larger 'max_depth'), but deeper trees can result in much slower model fitting and predictions. If using pooled gain, one might want to substitute 'max_depth' with 'min_gain'. |
ncols_per_tree |
Number of columns to use (have as potential candidates for splitting at each iteration) in each tree, somewhat similar to the 'mtry' parameter of random forests. In general, this is only relevant when using non-random splits and/or weighted column choices. If passing a number between zero and one, will assume it means taking a sample size that represents that proportion of the columns in the data. Note that, if passing exactly 1, will assume it means taking 100% of the columns, rather than taking a single column. If passing 'NULL', will use the full number of columns in the data. |
prob_pick_pooled_gain |
his parameter indicates the probability of choosing the threshold on which to split a variable (with 'ndim=1') or a linear combination of variables (when using 'ndim>1') as the threshold that maximizes a pooled standard deviation gain criterion (see references [9] and [11]) on the same variable or linear combination, similarly to regression trees such as CART. If using 'ntry>1', will try several variables or linear combinations thereof and choose the one in which the largest standardized gain can be achieved. For categorical variables with 'ndim=1', will use shannon entropy instead (like in [7]). Compared to a simple averaged gain, this tends to result in more evenly-divided splits and more clustered groups when they are smaller. Recommended to pass higher values when used for imputation of missing values. When used for outlier detection, datasets with multimodal distributions usually see better performance under this type of splits. Note that, since this makes the trees more even and thus it takes more steps to produce isolated nodes, the resulting object will be heavier. When splits are not made according to any of 'prob_pick_avg_gain', 'prob_pick_pooled_gain', 'prob_pick_full_gain', 'prob_pick_dens', both the column and the split point are decided at random. Note that, if passing value 1 (100%) with no sub-sampling and using the single-variable model, every single tree will have the exact same splits. Be aware that 'penalize_range' can also have a large impact when using 'prob_pick_pooled_gain'. Under this option, models are likely to produce better results when increasing 'max_depth'. Alternatively, one can also control the depth through 'min_gain' (for which one might want to set 'max_depth=NULL'). Important detail: if using any of 'prob_pick_avg_gain' or 'prob_pick_pooled_gain', 'prob_pick_full_gain', 'prob_pick_dens', the distribution of outlier scores is unlikely to be centered around 0.5. |
prob_pick_avg_gain |
This parameter indicates the probability of choosing the threshold on which to split a variable (with 'ndim=1') or a linear combination of variables (when using 'ndim>1') as the threshold that maximizes an averaged standard deviation gain criterion (see references [4] and [11]) on the same variable or linear combination. If using 'ntry>1', will try several variables or linear combinations thereof and choose the one in which the largest standardized gain can be achieved. For categorical variables with 'ndim=1', will take the expected standard deviation that would be gotten if the column were converted to numerical by assigning to each category a random number '~ Unif(0, 1)' and calculate gain with those assumed standard deviations. Compared to a pooled gain, this tends to result in more cases in which a single observation or very few of them are put into one branch. Typically, datasets with outliers defined by extreme values in some column more or less independently of the rest, usually see better performance under this type of split. Recommended to use sub-samples (parameter 'sample_size') when passing this parameter. Note that, since this will create isolated nodes faster, the resulting object will be lighter (use less memory). When splits are not made according to any of 'prob_pick_avg_gain', 'prob_pick_pooled_gain', 'prob_pick_full_gain', 'prob_pick_dens', both the column and the split point are decided at random. Default setting for [1], [2], [3] is zero, and default for [4] is 1. This is the randomization parameter that can be passed to the author's original code in [5], but note that the code in [5] suffers from a mathematical error in the calculation of running standard deviations, so the results from it might not match with this library's. Be aware that, if passing a value of 1 (100%) with no sub-sampling and using the single-variable model, every single tree will have the exact same splits. Under this option, models are likely to produce better results when increasing 'max_depth'. Important detail: if using either 'prob_pick_avg_gain', 'prob_pick_pooled_gain', 'prob_pick_full_gain', 'prob_pick_dens', the distribution of outlier scores is unlikely to be centered around 0.5. |
prob_pick_full_gain |
This parameter indicates the probability of choosing the threshold on which to split a variable (with 'ndim=1') or a linear combination of variables (when using 'ndim>1') as the threshold that minimizes the pooled sums of variances of all columns (or a subset of them if using 'ncols_per_tree'). In general, this is much slower to evaluate than the other gain types, and does not tend to lead to better results. When using this option, one might want to use a different scoring metric (particulatly '"density"', '"boxed_density2"' or '"boxed_ratio"'). Note that the calculations are all done through the (exact) sorted-indices approach, while is much slower than the (approximate) histogram approach used by other decision tree software. Be aware that the data is not standardized in any way for the variance calculations, thus the scales of features will make a large difference under this option, which might not make it suitable for all types of data. his option is not compatible with categorical data, and 'min_gain' does not apply to it. When splits are not made according to any of 'prob_pick_avg_gain', 'prob_pick_pooled_gain', 'prob_pick_full_gain', 'prob_pick_dens', both the column and the split point are decided at random. Default setting for references [1], [2], [3], [4] is zero. |
prob_pick_dens |
This parameter indicates the probability of choosing the threshold on which to split a variable (with 'ndim=1') or a linear combination of variables (when using 'ndim>1') as the threshold that maximizes the pooled densities of the branch distributions. The 'min_gain' option does not apply to this type of splits. When splits are not made according to any of 'prob_pick_avg_gain', 'prob_pick_pooled_gain', 'prob_pick_full_gain', 'prob_pick_dens', both the column and the split point are decided at random. Default setting for [1], [2], [3], [4] is zero. |
prob_pick_col_by_range |
When using 'ndim=1', this denotes the probability of choosing the column to split with a probability proportional to the range spanned by each column within a node as proposed in reference [12]. When using 'ndim>1', this denotes the probability of choosing columns to create a hyperplane with a probability proportional to the range spanned by each column within a node. This option is not compatible with categorical data. If passing column weights, the effect will be multiplicative. Be aware that the data is not standardized in any way for the range calculations, thus the scales of features will make a large difference under this option, which might not make it suitable for all types of data. If there are infinite values, all columns having infinite values will be treated as having the same weight, and will be chosen before every other column with non-infinite values. Note that the proposed RRCF model from [12] uses a different scoring metric for producing anomaly scores, while this library uses isolation depth regardless of how columns are chosen, thus results are likely to be different from those of other software implementations. Nevertheless, as explored in [11], isolation depth as a scoring metric typically provides better results than the "co-displacement" metric from [12] under these split types. |
prob_pick_col_by_var |
When using 'ndim=1', this denotes the probability of choosing the column to split with a probability proportional to the variance of each column within a node. When using 'ndim>1', this denotes the probability of choosing columns to create a hyperplane with a probability proportional to the variance of each column within a node. For categorical data, it will calculate the expected variance if the column were converted to numerical by assigning to each category a random number '~ Unif(0, 1)', which depending on the number of categories and their distribution, produces numbers typically a bit smaller than standardized numerical variables. Note that when using sparse matrices, the calculation of variance will rely on a procedure that uses sums of squares, which has less numerical precision than the calculation used for dense inputs, and as such, the results might differ slightly. Be aware that this calculated variance is not standardized in any way, so the scales of features will make a large difference under this option. If passing column weights, the effect will be multiplicative. If passing a 'missing_action' different than "fail", infinite values will be ignored for the variance calculation. Otherwise, all columns with infinite values will have the same probability and will be chosen before columns with non-infinite values. |
prob_pick_col_by_kurt |
When using 'ndim=1', this denotes the probability of choosing the column to split with a probability proportional to the kurtosis of each column within a node (unlike the option 'weigh_by_kurtosis' which calculates this metric only at the root). When using 'ndim>1', this denotes the probability of choosing columns to create a hyperplane with a probability proportional to the kurtosis of each column within a node. For categorical data, it will calculate the expected kurtosis if the column were converted to numerical by assigning to each category a random number '~ Unif(0, 1)'. Note that when using sparse matrices, the calculation of kurtosis will rely on a procedure that uses sums of squares and higher-power numbers, which has less numerical precision than the calculation used for dense inputs, and as such, the results might differ slightly. If passing column weights, the effect will be multiplicative. This option is not compatible with 'weigh_by_kurtosis'. If passing a 'missing_action' different than "fail", infinite values will be ignored for the kurtosis calculation. Otherwise, all columns with infinite values will have the same probability and will be chosen before columns with non-infinite values. If using 'missing_action="impute"', the calculation of kurtosis will not use imputed values in order not to favor columns with missing values (which would increase kurtosis by all having the same central value). Be aware that kurtosis can be a rather slow metric to calculate. |
min_gain |
Minimum gain that a split threshold needs to produce in order to proceed with a split. Only used when the splits are decided by a variance gain criterion ('prob_pick_pooled_gain' or 'prob_pick_avg_gain', but not 'prob_pick_full_gain' nor 'prob_pick_dens'). If the highest possible gain in the evaluated splits at a node is below this threshold, that node becomes a terminal node. This can be used as a more sophisticated depth control when using pooled gain (note that 'max_depth' still applies on top of this heuristic). |
missing_action |
How to handle missing data at both fitting and prediction time. Options are
In the extended model, infinite values will be treated as missing. Passing '"fail"' will produce faster fitting and prediction times along with decreased model object sizes. Models from references [1], [2], [3], [4] correspond to '"fail"' here. Typically, models with 'ndim>1' are less affected by missing data that models with 'ndim=1'. |
new_categ_action |
What to do after splitting a categorical feature when new data that reaches that split has categories that the sub-sample from which the split was done did not have. Options are
Ignored when passing 'categ_split_type' = '"single_categ"'. |
categ_split_type |
Whether to split categorical features by assigning sub-sets of them to each branch (by passing '"subset"' there), or by assigning a single category to a branch and the rest to the other branch (by passing '"single_categ"' here). For the extended model, whether to give each category a coefficient ('"subset"'), or only one while the rest get zero ('"single_categ"'). |
all_perm |
When doing categorical variable splits by pooled gain with 'ndim=1' (single-variable model), whether to consider all possible permutations of variables to assign to each branch or not. If 'FALSE', will sort the categories by their frequency and make a grouping in this sorted order. Note that the number of combinations evaluated (if 'TRUE') is the factorial of the number of present categories in a given column (minus 2). For averaged gain, the best split is always to put the second most-frequent category in a separate branch, so not evaluating all permutations (passing 'FALSE') will make it possible to select other splits that respect the sorted frequency order. Ignored when not using categorical variables or not doing splits by pooled gain or using 'ndim>1'. |
coef_by_prop |
In the extended model, whether to sort the randomly-generated coefficients for categories according to their relative frequency in the tree node. This might provide better results when using categorical variables with too many categories, but is not recommended, and not reflective of real "categorical-ness". Ignored for the single-variable model ('ndim=1') and/or when not using categorical variables. |
recode_categ |
Whether to re-encode categorical variables even in case they are already passed as factors. This is recommended as it will eliminate potentially redundant categorical levels if they have no observations, but if the categorical variables are already of type 'factor' with only the levels that are present, it can be skipped for slightly faster fitting times. You'll likely want to pass 'FALSE' here if merging several models into one through isotree.append.trees. |
weights_as_sample_prob |
If passing sample (row) weights when fitting the model, whether to consider those weights as row sampling weights (i.e. the higher the weights, the more likely the observation will end up included in each tree sub-sample), or as distribution density weights (i.e. putting a weight of two is the same as if the row appeared twice, thus higher weight makes it less of an outlier, but does not give it a higher chance of being sampled if the data uses sub-sampling). |
sample_with_replacement |
Whether to sample rows with replacement or not (not recommended). Note that distance calculations, if desired, don't work when there are duplicate rows. This option is not compatible with 'output_score', 'output_dist', 'output_imputations'. |
penalize_range |
Whether to penalize (add -1 to the terminal depth) observations at prediction time that have a value of the chosen split variable (linear combination in extended model) that falls outside of a pre-determined reasonable range in the data being split (given by '2 * range' in data and centered around the split point), as proposed in reference [4] and implemented in the authors' original code in reference [5]. Not used in single-variable model when splitting by categorical variables. This option is not supported when using density-based outlier scoring metrics. It's recommended to turn this off for faster predictions on sparse CSC matrices. Note that this can make a very large difference in the results when using 'prob_pick_pooled_gain'. Be aware that this option can make the distribution of outlier scores a bit different (i.e. not centered around 0.5). |
standardize_data |
Whether to standardize the features at each node before creating alinear combination of them as suggested in [4]. This is ignored when using 'ndim=1'. |
scoring_metric |
Metric to use for determining outlier scores (see reference [13]). Options are:
|
fast_bratio |
When using "boxed" metrics for scoring, whether to calculate them in a fast way through cumulative sum of logarithms of ratios after each split, or in a slower way as sum of logarithms of a single ratio per column for each terminal node. Usually, both methods should give the same results, but in some datasets, particularly when variables have too small or too large ranges, the first method can be prone to numerical inaccuracies due to roundoff close to zero. Note that this does not affect calculations for models with 'ndim>1', since given the split types, the calculation for them is different. |
weigh_by_kurtosis |
Whether to weigh each column according to the kurtosis obtained in the sub-sample that is selected for each tree as briefly proposed in reference [1]. Note that this is only done at the beginning of each tree sample. For categorical columns, will calculate expected kurtosis if the column were converted to numerical by assigning to each category a random number '~ Unif(0, 1)'. Note that when using sparse matrices, the calculation of kurtosis will rely on a procedure that uses sums of squares and higher-power numbers, which has less numerical precision than the calculation used for dense inputs, and as such, the results might differ slightly. Using this option makes the model more likely to pick the columns that have anomalous values when viewed as a 1-d distribution, and can bring a large improvement in some datasets. This is intended as a cheap feature selector, while the parameter 'prob_pick_col_by_kurt' provides the option to do this at each node in the tree for a different overall type of model. If passing column weights or using weighted column choices proportional to some other metric ('prob_pick_col_by_range', 'prob_pick_col_by_var'), the effect will be multiplicative. If passing 'missing_action="fail"' and the data has infinite values, columns with rows having infinite values will get a weight of zero. If passing a different value for missing action, infinite values will be ignored in the kurtosis calculation. If using 'missing_action="impute"', the calculation of kurtosis will not use imputed values in order not to favor columns with missing values (which would increase kurtosis by all having the same central value). |
coefs |
For the extended model, whether to sample random coefficients according to a normal distribution '~ N(0, 1)' (as proposed in reference [4]) or according to a uniform distribution '~ Unif(-1, +1)' as proposed in reference [3]. Ignored for the single-variable model. Note that, for categorical variables, the coefficients will be sampled ~ N (0,1) regardless - in order for both types of variables to have transformations in similar ranges (which will tend to boost the importance of categorical variables), pass '"uniform"' here. |
assume_full_distr |
When calculating pairwise distances (see reference [8]), whether to assume that the fitted model represents a full population distribution (will use a standardizing criterion assuming infinite sample as in reference [6], and the results of the similarity between two points at prediction time will not depend on the prescence of any third point that is similar to them, but will differ more compared to the pairwise distances between points from which the model was fit). If passing 'FALSE', will calculate pairwise distances as if the new observations at prediction time were added to the sample to which each tree was fit, which will make the distances between two points potentially vary according to other newly introduced points. This will not be assumed when the distances are calculated as the model is being fit (see documentation for parameter 'output_dist'). This was added for experimentation purposes only and it's not recommended to pass 'FALSE'. Note that when calculating distances using a tree indexer (after calling isotree.build.indexer), there might be slight discrepancies between the numbers produced with or without the indexer due to what are considered "additional" observations in this calculation. |
build_imputer |
Whether to construct missing-value imputers so that later this same model could be used to impute missing values of new (or the same) observations. Be aware that this will significantly increase the memory requirements and serialized object sizes. Note that this is not related to 'missing_action' as missing values inside the model are treated differently and follow their own imputation or division strategy. |
output_imputations |
Whether to output imputed missing values for 'data'. Passing 'TRUE' here will force 'build_imputer' to 'TRUE'. Note that, for sparse matrix inputs, even though the output will be sparse, it will generate a dense representation of each row with missing values. This is not supported when using sub-sampling, and if sub-sampling is specified, will override it using the full number of rows. |
min_imp_obs |
Minimum number of observations with which an imputation value can be produced. Ignored if passing 'build_imputer' = 'FALSE'. |
depth_imp |
How to weight observations according to their depth when used for imputing missing values. Passing '"higher"' will weigh observations higher the further down the tree (away from the root node) the terminal node is, while '"lower"' will do the opposite, and '"same"' will not modify the weights according to node depth in the tree. Implemented for testing purposes and not recommended to change from the default. Ignored when passing 'build_imputer' = 'FALSE'. |
weigh_imp_rows |
How to weight node sizes when used for imputing missing values. Passing '"inverse"' will weigh a node inversely proportional to the number of observations that end up there, while '"prop"' will weight them heavier the more observations there are, and '"flat"' will weigh all nodes the same in this regard regardless of how many observations end up there. Implemented for testing purposes and not recommended to change from the default. Ignored when passing 'build_imputer' = 'FALSE'. |
output_score |
Whether to output outlierness scores for the input data, which will be calculated as the model is being fit and it's thus faster. Cannot be done when using sub-samples of the data for each tree (in such case will later need to call the 'predict' function on the same data). If using 'penalize_range', the results from this might differet a bit from those of 'predict' called after. This is not supported when using sub-sampling, and if sub-sampling is specified, will override it using the full number of rows. |
output_dist |
Whether to output pairwise distances for the input data, which will be calculated as the model is being fit and it's thus faster. Cannot be done when using sub-samples of the data for each tree (in such case will later need to call the 'predict' function on the same data). If using 'penalize_range', the results from this might differ a bit from those of 'predict' called after. This is not supported when using sub-sampling, and if sub-sampling is specified, will override it using the full number of rows. Note that it might be much faster to calculate distances through a fitted model object with isotree.build.indexer instead or calculating them while fitting like this. |
square_dist |
If passing 'output_dist' = 'TRUE', whether to return a full square matrix or just the upper-triangular part, in which the entry for pair (i,j) with 1 <= i < j <= n is located at position p(i, j) = ((i - 1) * (n - i/2) + j - i). |
sample_weights |
Sample observation weights for each row of 'data', with higher weights indicating either higher sampling probability (i.e. the observation has a larger effect on the fitted model, if using sub-samples), or distribution density (i.e. if the weight is two, it has the same effect of including the same data point twice), according to parameter 'weights_as_sample_prob'. Not supported when calculating pairwise distances while the model is being fit (done by passing 'output_dist' = 'TRUE'). If 'data' is a 'data.frame' and the variable passed here matches to the name of a column in 'data' (with or without enclosing 'sample_weights' in quotes), it will assume the weights are to be taken as that column name. |
column_weights |
Sampling weights for each column in 'data'. Ignored when picking columns by deterministic criterion. If passing 'NULL', each column will have a uniform weight. If used along with kurtosis weights, the effect is multiplicative. Note that, if passing a data.frame with both numeric and categorical columns, the column names must not be repeated, otherwise the column weights passed here will not end up matching. If passing a 'data.frame' to 'data', will assume the column order is the same as in there, regardless of whether the entries passed to 'column_weights' are named or not. |
lazy_serialization |
Whether to use a lazy serialization mechanism for the model C++ objects through the ALTREP system, which would only call the serialization and de-serialization methods when needed. Passing 'TRUE' here has the following effects:
If passing 'FALSE', on the other hand:
|
seed |
Seed that will be used for random number generation. |
use_long_double |
Whether to use 'long double' (extended precision) type for more precise calculations about
standard deviations, means, ratios, weights, gain, and other potential aggregates. This makes
such calculations accurate to a larger number of decimals (provided that the compiler used has
wider long doubles than doubles) and it is highly recommended to use when the input data has
a number of rows or columns exceeding If 'long double' is not going to be used, the library can be compiled without support for it (making the library size smaller) by defining an environment variable 'NO_LONG_DOUBLE' before installing this package (e.g. through 'Sys.setenv("NO_LONG_DOUBLE" = "1")' before running the ‘install.packages' command). If R itself was compiled without ’long double' support, this library will follow suit and disable long double too. This option is not available on Windows, due to lack of support in some compilers (e.g. msvc) and lack of thread-safety in the calculations in others (e.g. mingw). |
nthreads |
Number of parallel threads to use. Note that, the more threads, the more memory will be allocated, even if the thread does not end up being used. Be aware that most of the operations are bound by memory bandwidth, which means that adding more threads will not result in a linear speed-up. For some types of data (e.g. large sparse matrices with small sample sizes), adding more threads might result in only a very modest speed up (e.g. 1.5x faster with 4x more threads), even if all threads look fully utilized. |
Details
If requesting outlier scores or depths or separation/distance while fitting the model and using multiple threads, there can be small differences in the predicted scores/depth/separation/distance between runs due to roundoff error.
Value
If passing 'output_score' = 'FALSE', 'output_dist' = 'FALSE', and 'output_imputations' = 'FALSE' (the defaults), will output an 'isolation_forest' object from which 'predict' method can then be called on new data.
If passing 'TRUE' to any of the former options, will output a list with entries:
'model': the 'isolation_forest' object from which new predictions can be made.
'scores': a vector with the outlier score for each inpuit observation (if passing 'output_score' = 'TRUE').
'dist': the distances (either a 'dist' object or a square matrix), if passing 'output_dist' = 'TRUE'.
'imputed': the input data with missing values imputed according to the model (if passing 'output_imputations' = 'TRUE').
Matching models from references
Shorthands for parameter combinations that match some of the references:
'iForest' (reference [1]): 'ndim=1', 'sample_size=256', 'max_depth=8', 'ntrees=100', 'missing_action="fail"'.
'EIF' (reference [3]): 'ndim=2', 'sample_size=256', 'max_depth=8', 'ntrees=100', 'missing_action="fail"', 'coefs="uniform"', 'standardize_data=False' (plus standardizing the data before passing it).
'SCiForest' (reference [4]): 'ndim=2', 'sample_size=256', 'max_depth=8', 'ntrees=100', 'missing_action="fail"', 'coefs="normal"', 'ntry=10', 'prob_pick_avg_gain=1', 'penalize_range=True'. Might provide much better results with ‘max_depth=NULL' despite the reference’s recommendation.
'FCF' (reference [11]): 'ndim=2', 'sample_size=256', 'max_depth=NULL', 'ntrees=200', 'missing_action="fail"', 'coefs="normal"', 'ntry=1', 'prob_pick_pooled_gain=1'. Might provide similar or better results with 'ndim=1' and/or sample size as low as 32. For the FCF model aimed at imputing missing values, might give better results with 'ntry=10' or higher and much larger sample sizes.
'RRCF' (reference [12]): 'ndim=1', 'prob_pick_col_by_range=1', 'sample_size=256' or more, 'max_depth=NULL', 'ntrees=100' or more, 'missing_action="fail"'. Note however that reference [12] proposed a different method for calculation of anomaly scores, while this library uses isolation depth just like for 'iForest', so results might differ significantly from those of other libraries. Nevertheless, experiments in reference [11] suggest that isolation depth might be a better scoring metric for this model.
Model serving considerations
If the model is built with 'nthreads>1', the prediction function predict.isolation_forest will use OpenMP for parallelization. In a linux setup, one usually has GNU's "gomp" as OpenMP as backend, which will hang when used in a forked process - for example, if one tries to call this prediction function from 'RestRserve', which uses process forking for parallelization, it will cause the whole application to freeze. A potential fix in these cases is to pass 'nthreads=1' to 'predict', or to set the number of threads to 1 in the model object (e.g. 'model$nthreads <- 1L' or calling isotree.set.nthreads), or to compile this library without OpenMP (requires manually altering the 'Makevars' file), or to use a non-GNU OpenMP backend (such as LLVM's 'libomp'. This should not be an issue when using this library normally in e.g. an RStudio session.
The R objects that hold the models contain heap-allocated C++ objects which do not map to R types and which thus do not survive serializations the same way R objects do. In order to make model objects serializable (i.e. usable with 'save', 'saveRDS', and similar), the package offers two mechanisms: (a) a 'lazy_serialization' option which uses the ALTREP system as a workaround, by defining classes with serialization methods but without datapointer methods (see the docs for 'lazy_serialization' for more info); (b) a more theoretically correct way in which raw bytes are produced alongside the model and from which the C++ objects can be reconstructed. When using the lazy serialization system, C++ objects are restored automatically on load and the serialized bytes then discarded, but this is not the case when using the serialized bytes approach. For model serving, one would usually want to drop these serialized bytes after having loaded a model through 'readRDS' or similar (note that reconstructing the C++ object will first require calling isotree.restore.handle, which is done automatically when calling 'predict' and similar), as they can increase memory usage by a large amount. These redundant raw bytes can be dropped as follows: 'model$cpp_objects$model$ser <- NULL' (and an additional 'model$cpp_objects$imputer$ser <- NULL' when using 'build_imputer=TRUE', and 'model$cpp_objects$indexer$ser <- NULL' when building a node indexer). After that, one might want to force garbage collection through 'gc()'.
Usually, for serving purposes, one wants a setup as minimalistic as possible (e.g. smaller docker images). This library can be made smaller and faster to compile by disabling some features - particularly, the library will by default build with support for calculation of aggregated metrics (such as standard deviations) in 'long double' precision (an extended precision type), which is a functionality that's unlikely to get used (default is not to use this type as it is slower, and calculations done in the ‘predict' function do not use it for anything). Support for ’long double' can be disable at compile time by setting up an environment variable 'NO_LONG_DOUBLE' before installing the package (e.g. by issuing command 'Sys.setenv("NO_LONG_DOUBLE" = "1")' before 'install.packages').
References
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation forest." 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008.
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.
Hariri, Sahand, Matias Carrasco Kind, and Robert J. Brunner. "Extended Isolation Forest." arXiv preprint arXiv:1811.02141 (2018).
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "On detecting clustered anomalies using SCiForest." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, 2010.
Quinlan, J. Ross. "C4. 5: programs for machine learning." Elsevier, 2014.
Cortes, David. "Distance approximation using Isolation Forests." arXiv preprint arXiv:1910.12362 (2019).
Cortes, David. "Imputing missing values with unsupervised random trees." arXiv preprint arXiv:1911.06646 (2019).
Cortes, David. "Revisiting randomized choices in isolation forests." arXiv preprint arXiv:2110.13402 (2021).
Guha, Sudipto, et al. "Robust random cut forest based anomaly detection on streams." International conference on machine learning. PMLR, 2016.
Cortes, David. "Isolation forests: looking beyond tree depth." arXiv preprint arXiv:2111.11639 (2021).
Ting, Kai Ming, Yue Zhu, and Zhi-Hua Zhou. "Isolation kernel and its effect on SVM." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
See Also
predict.isolation_forest, isotree.add.tree isotree.restore.handle
Examples
### Example 1: detect an obvious outlier
### (Random data from a standard normal distribution)
library(isotree)
set.seed(1)
m <- 100
n <- 2
X <- matrix(rnorm(m * n), nrow = m)
### Will now add obvious outlier point (3, 3) to the data
X <- rbind(X, c(3, 3))
### Fit a small isolation forest model
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
### Check which row has the highest outlier score
pred <- predict(iso, X)
cat("Point with highest outlier score: ",
X[which.max(pred), ], "\n")
### Example 2: plotting outlier regions
### This example shows predicted outlier score in a small
### grid, with a model fit to a bi-modal distribution. As can
### be seen, the extended model is able to detect high
### outlierness outside of both regions, without having false
### ghost regions of low-outlierness where there isn't any data
library(isotree)
oldpar <- par(mfrow = c(2, 2), mar = c(2.5,2.2,2,2.5))
### Randomly-generated data from different distributions
set.seed(1)
group1 <- data.frame(x = rnorm(1000, -1, .4),
y = rnorm(1000, -1, .2))
group2 <- data.frame(x = rnorm(1000, +1, .2),
y = rnorm(1000, +1, .4))
X = rbind(group1, group2)
### Add an obvious outlier which is within the 1d ranges
### (As an interesting test, remove the outlier and see what happens,
### or check how its score changes when using sub-sampling or
### changing the scoring metric for 'ndim=1')
X = rbind(X, c(-1, 1))
### Produce heatmaps
pts = seq(-3, 3, .1)
space_d <- expand.grid(x = pts, y = pts)
plot.space <- function(Z, ttl) {
image(pts, pts, matrix(Z, nrow = length(pts)),
col = rev(heat.colors(50)),
main = ttl, cex.main = 1.4,
xlim = c(-3, 3), ylim = c(-3, 3),
xlab = "", ylab = "")
par(new = TRUE)
plot(X, type = "p", xlim = c(-3, 3), ylim = c(-3, 3),
col = "#0000801A",
axes = FALSE, main = "",
xlab = "", ylab = "")
}
### Now try out different variations of the model
### Single-variable model
iso_simple = isolation.forest(
X, ndim=1,
ntrees=100,
nthreads=1,
penalize_range=FALSE,
prob_pick_pooled_gain=0,
prob_pick_avg_gain=0)
Z1 <- predict(iso_simple, space_d)
plot.space(Z1, "Isolation Forest")
### Extended model
iso_ext = isolation.forest(
X, ndim=2,
ntrees=100,
nthreads=1,
penalize_range=FALSE,
prob_pick_pooled_gain=0,
prob_pick_avg_gain=0)
Z2 <- predict(iso_ext, space_d)
plot.space(Z2, "Extended Isolation Forest")
### SCiForest
iso_sci = isolation.forest(
X, ndim=2, ntry=1,
coefs="normal",
ntrees=100,
nthreads=1,
penalize_range=TRUE,
prob_pick_pooled_gain=0,
prob_pick_avg_gain=1)
Z3 <- predict(iso_sci, space_d)
plot.space(Z3, "SCiForest")
### Fair-cut forest
iso_fcf = isolation.forest(
X, ndim=2,
ntrees=100,
nthreads=1,
penalize_range=FALSE,
prob_pick_pooled_gain=1,
prob_pick_avg_gain=0)
Z4 <- predict(iso_fcf, space_d)
plot.space(Z4, "Fair-Cut Forest")
par(oldpar)
### (As another interesting variation, try setting
### 'penalize_range=TRUE' for the last model)
### Example 3: calculating pairwise distances,
### with a short validation against euclidean dist.
library(isotree)
### Generate random data with 3 dimensions
set.seed(1)
m <- 100
n <- 3
X <- matrix(rnorm(m * n), nrow=m, ncol=n)
### Fit isolation forest model
iso <- isolation.forest(X, ndim=2, ntrees=100, nthreads=1)
### Calculate distances with the model
### (this can be accelerated with 'isotree.build.indexer')
D_iso <- predict(iso, X, type = "dist")
### Check that it correlates with euclidean distance
D_euc <- dist(X, method = "euclidean")
cat(sprintf("Correlation with euclidean distance: %f\n",
cor(D_euc, D_iso)))
### (Note that euclidean distance will never take
### any correlations between variables into account,
### which the isolation forest model can do)
### Example 4: imputing missing values
### (requires package MASS)
library(isotree)
### Generate random data, set some values as NA
if (require("MASS")) {
set.seed(1)
S <- crossprod(matrix(rnorm(5 * 5), nrow = 5))
mu <- rnorm(5)
X <- MASS::mvrnorm(1000, mu, S)
X_na <- X
values_NA <- matrix(runif(1000 * 5) < .15, nrow = 1000)
X_na[values_NA] = NA
### Impute missing values with model
iso <- isolation.forest(
X_na,
build_imputer = TRUE,
prob_pick_pooled_gain = 1,
ndim = 2,
ntry = 10,
nthreads = 1
)
X_imputed <- predict(iso, X_na, type = "impute")
cat(sprintf("MSE for imputed values w/model: %f\n",
mean((X[values_NA] - X_imputed[values_NA])^2)))
### Compare against simple mean imputation
X_means <- apply(X, 2, mean)
X_imp_mean <- X_na
for (cl in 1:5)
X_imp_mean[values_NA[,cl], cl] <- X_means[cl]
cat(sprintf("MSE for imputed values w/means: %f\n",
mean((X[values_NA] - X_imp_mean[values_NA])^2)))
}
#### A more interesting example
#### (requires package outliertree)
### Compare outliers returned by these different methods,
### and see why some of the outliers returned by the
### isolation forest could be flagged as outliers
if (require("outliertree")) {
hypothyroid <- outliertree::hypothyroid
iso <- isolation.forest(hypothyroid, nthreads=1)
pred_iso <- predict(iso, hypothyroid)
otree <- outliertree::outlier.tree(
hypothyroid,
z_outlier = 6,
pct_outliers = 0.02,
outliers_print = 20,
nthreads = 1)
### Now compare against the top
### outliers from isolation forest
head(hypothyroid[order(-pred_iso), ], 20)
}