adversarial_rf {arf} | R Documentation |
Implements an adversarial random forest to learn independence-inducing splits.
adversarial_rf(
x,
num_trees = 10L,
min_node_size = 2L,
delta = 0,
max_iters = 10L,
early_stop = TRUE,
verbose = TRUE,
parallel = TRUE,
...
)
x |
Input data. Integer variables are recoded as ordered factors with a warning. See Details. |
num_trees |
Number of trees to grow in each forest. The default works well for most generative modeling tasks, but should be increased for likelihood estimation. See Details. |
min_node_size |
Minimal number of real data samples in leaf nodes. |
delta |
Tolerance parameter. Algorithm converges when OOB accuracy is
< 0.5 + |
max_iters |
Maximum iterations for the adversarial loop. |
early_stop |
Terminate loop if performance fails to improve from one round to the next? |
verbose |
Print discriminator accuracy after each round? |
parallel |
Compute in parallel? Must register backend beforehand, e.g.
via |
... |
Extra parameters to be passed to |
The adversarial random forest (ARF) algorithm partitions data into fully
factorized leaves where features are jointly independent. ARFs are trained
iteratively, with alternating rounds of generation and discrimination. In
the first instance, synthetic data is generated via independent bootstraps of
each feature, and a RF classifier is trained to distinguish between real and
synthetic samples. In subsequent rounds, synthetic data is generated
separately in each leaf, using splits from the previous forest. This creates
increasingly realistic data that satisfies local independence by construction.
The algorithm converges when a RF cannot reliably distinguish between the two
classes, i.e. when OOB accuracy falls below 0.5 + delta
.
ARFs are useful for several unsupservised learning tasks, such as density
estimation (see forde
) and data synthesis (see
forge
). For the former, we recommend increasing the number of
trees for improved performance (typically on the order of 100-1000 depending
on sample size).
Integer variables are treated as ordered factors by default. If the ARF is
passed to forde
, the estimated distribution for these variables will
only have support on observed factor levels (i.e., the output will be a pmf,
not a pdf). To override this behavior and assign nonzero density to
intermediate values, explicitly recode the features as numeric.
Note: convergence is not guaranteed in finite samples. The max_iter
argument sets an upper bound on the number of training rounds. Similar
results may be attained by increasing delta
. Even a single round can
often give good performance, but data with strong or complex dependencies may
require more iterations. With the default early_stop = TRUE
, the
adversarial loop terminates if performance does not improve from one round
to the next, in which case further training may be pointless.
A random forest object of class ranger
.
Watson, D., Blesch, K., Kapar, J., & Wright, M. (2022). Adversarial random forests for density estimation and generative modeling. arXiv preprint, 2205.09435.
arf <- adversarial_rf(iris)