discretizeMutual {miic}R Documentation

Iterative dynamic programming for (conditional) mutual information through optimized discretization.

Description

This function chooses cutpoints in the input distributions by maximizing the mutual information minus a complexity cost (computed as BIC or with the Normalized Maximum Likelihood ). The (conditional) mutual information computed on the optimized discretized distributions effectively approaches the mutual information computed on the original continuous variables.

Usage

discretizeMutual(
  X,
  Y,
  matrix_u = NULL,
  maxbins = NULL,
  cplx = "nml",
  n_eff = NULL,
  sample_weights = NULL,
  is_discrete = NULL,
  plot = TRUE
)

Arguments

X

[a vector] A vector that contains the observational data of the first variable.

Y

[a vector] A vector that contains the observational data of the second variable.

matrix_u

[a numeric matrix] The matrix with the observations of as many columns as conditioning variables.

maxbins

[an int] The maximum number of bins desired in the discretization. A lower number makes the computation faster, a higher number allows finer discretization (by default : 5 * cubic root of N).

cplx

[a string] The complexity used in the dynamic programming. Either "mdl" for Minimum description Length or "nml" for Normalized Maximum Likelihood, which is less costly in the finite sample case and will allow more bins than mdl.

n_eff

[an int] The number of effective samples. When there is significant autocorrelation in the samples you may want to specify a number of effective samples that is lower than the number of points in the distribution.

sample_weights

[a vector of floats] Individual weights for each sample, used for the same reason as the effective sample number but with individual precision.

is_discrete

[a vector of booleans] Specify if each variable is to be treated as discrete (TRUE) or continuous (FALSE) in a logical vector of length ncol(matrix_u) + 2, in the order [X, Y, U1, U2...]. By default, factors and character vectors are treated as discrete, and numerical vectors as continuous.

plot

[a boolean] Specify if the XY joint space with discretization scheme is to be plotted or not (requires ggplot2 and gridExtra).

Details

For a pair of variables X and Y, the algorithm will in turn choose cutpoints on X then on Y, maximizing I(X_{d};Y_{d}) - cplx(X_{d};Y_{d}) where cplx(X_{d};Y_{d}) is the complexity cost of the considered discretizations of X and Y (see Affeldt 2016 and Cabeli 2020). When the value I(X_{d};Y_{d}) is stable between two iterations the discretization scheme of X_{d} and Y_{d} is returned as well as I(X_{d};Y_{d}) and I(X_{d};Y_{d})-cplx(X_{d};Y_{d}).

With a set of conditioning variables U, the discretization scheme maximizes each term of the sum I(X;Y|U) \sim 0.5*(I(X_{d};Y_{d}, U_{d}) - I(X_{d};U_{d}) + I(Y_{d};X_{d}, U_{d}) - I(Y_{d};U_{d})).

Discrete variables can be passed as factors and will be used "as is" to maximize each term.

Value

A list that contains :

References

Examples

library(miic)
N <- 1000
# Dependence, conditional independence : X <- Z -> Y
Z <- runif(N)
X <- Z * 2 + rnorm(N, sd = 0.2)
Y <- Z * 2 + rnorm(N, sd = 0.2)
res <- discretizeMutual(X, Y, plot = FALSE)
message("I(X;Y) = ", res$info)
res <- discretizeMutual(X, Y, matrix_u = matrix(Z, ncol = 1), plot = FALSE)
message("I(X;Y|Z) = ", res$info)


# Conditional independence with categorical conditioning variable : X <- Z -> Y
Z <- sample(1:3, N, replace = TRUE)
X <- -as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N)
Y <- as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N)
res <- miic::discretizeMutual(X, Y, cplx = "nml")
message("I(X;Y) = ", res$info)
res <- miic::discretizeMutual(X, Y, matrix(Z, ncol = 1), is_discrete = c(FALSE, FALSE, TRUE))
message("I(X;Y|Z) = ", res$info)


# Independence, conditional dependence : X -> Z <- Y
X <- runif(N)
Y <- runif(N)
Z <- X + Y + rnorm(N, sd = 0.1)
res <- discretizeMutual(X, Y, plot = TRUE)
message("I(X;Y) = ", res$info)
res <- discretizeMutual(X, Y, matrix_u = matrix(Z, ncol = 1), plot = TRUE)
message("I(X;Y|Z) = ", res$info)



[Package miic version 1.5.3 Index]