R: Likelihood Estimation

lik {arf}

R Documentation

Likelihood Estimation

Description

Compute the likelihood of input data, optionally conditioned on some event(s).

Usage

lik(
  params,
  query,
  evidence = NULL,
  arf = NULL,
  oob = FALSE,
  log = TRUE,
  batch = NULL,
  parallel = TRUE
)

Arguments

`params`	Circuit parameters learned via `forde`.
`query`	Data frame of samples, optionally comprising just a subset of training features. Likelihoods will be computed for each sample. Missing features will be marginalized out. See Details.
`evidence`	Optional set of conditioning events. This can take one of three forms: (1) a partial sample, i.e. a single row of data with some but not all columns; (2) a data frame of conditioning events, which allows for inequalities; or (3) a posterior distribution over leaves. See Details.
`arf`	Pre-trained `adversarial_rf` or other object of class `ranger`. This is not required but speeds up computation considerably for total evidence queries. (Ignored for partial evidence queries.)
`oob`	Only use out-of-bag leaves for likelihood estimation? If `TRUE`, `x` must be the same dataset used to train `arf`. Only applicable for total evidence queries.
`log`	Return likelihoods on log scale? Recommended to prevent underflow.
`batch`	Batch size. The default is to compute densities for all of queries in one round, which is always the fastest option if memory allows. However, with large samples or many trees, it can be more memory efficient to split the data into batches. This has no impact on results.
`parallel`	Compute in parallel? Must register backend beforehand, e.g. via `doParallel`.

Details

This function computes the likelihood of input data, optionally conditioned on some event(s). Queries may be partial, i.e. covering some but not all features, in which case excluded variables will be marginalized out.

There are three methods for (optionally) encoding conditioning events via the evidence argument. The first is to provide a partial sample, where some but not all columns from the training data are present. The second is to provide a data frame with three columns: variable, relation, and value. This supports inequalities via relation. Alternatively, users may directly input a pre-calculated posterior distribution over leaves, with columns f_idx and wt. This may be preferable for complex constraints. See Examples.

Value

A vector of likelihoods, optionally on the log scale.

References

Watson, D., Blesch, K., Kapar, J., & Wright, M. (2023). Adversarial random forests for density estimation and generative modeling. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, pp. 5357-5375.

Examples

# Estimate average log-likelihood
arf <- adversarial_rf(iris)
psi <- forde(arf, iris)
ll <- lik(psi, iris, arf = arf, log = TRUE)
mean(ll)

# Identical but slower
ll <- lik(psi, iris, log = TRUE)
mean(ll)

# Partial evidence query
lik(psi, query = iris[1, 1:3])

# Condition on Species = "setosa"
evi <- data.frame(Species = "setosa")
lik(psi, query = iris[1, 1:3], evidence = evi)

# Condition on Species = "setosa" and Petal.Width > 0.3
evi <- data.frame(variable = c("Species", "Petal.Width"),
                  relation = c("==", ">"), 
                  value = c("setosa", 0.3))
lik(psi, query = iris[1, 1:3], evidence = evi)

[Package arf version 0.2.0 Index]