struct_em {gmgm}R Documentation

Learn the structure and the parameters of a Gaussian mixture graphical model with incomplete data

Description

This function learns the structure and the parameters of a Gaussian mixture graphical model with incomplete data using the structural EM algorithm. At each iteration, the parametric EM algorithm is performed to complete the data and update the parameters (E step). The completed data are then used to update the structure (M step), and so on. Each iteration is guaranteed to increase the scoring function until convergence to a local maximum (Koller and Friedman, 2009). In practice, due to the sampling process inherent in particle-based inference, it may happen that the monotonic increase no longer occurs when approaching the local maximum, resulting in an earlier termination of the algorithm.

Usage

struct_em(
  gmgm,
  data,
  nodes = structure(gmgm)$nodes,
  arcs_cand = tibble(lag = 0),
  col_seq = NULL,
  score = "bic",
  n_part = 1000,
  max_part_sim = 1e+06,
  min_ess = 1,
  max_iter_sem = 5,
  max_iter_pem = 5,
  verbose = FALSE,
  ...
)

Arguments

gmgm

An object of class gmbn (non-temporal) or gmdbn.

data

A data frame containing the data used for learning. Its columns must explicitly be named after nodes of gmgm and can contain missing values (columns with no value can be removed).

nodes

A character vector containing the nodes whose local conditional models are learned (by default all the nodes of gmgm). If gmgm is a gmdbn object, the same nodes are learned for each of its gmbn elements. This constraint can be overcome by passing a list of character vectors named after some of these elements (b_1, ...) and containing learned nodes specific to them.

arcs_cand

A data frame containing the candidate arcs for addition or removal (by default all possible non-temporal arcs). The column from describes the start node, the column to the end node and the column lag the time lag between them. Missing values in from or to are interpreted as "all possible nodes", which allows to quickly define large set of arcs that share common attributes. Missing values in lag are replaced by 0. If gmgm is a gmdbn object, the same candidate arcs are used for each of its gmbn elements. This constraint can be overcome by passing a list of data frames named after some of these elements (b_1, ...) and containing candidate arcs specific to them. If arcs already in gmgm are not candidates, they cannot be removed. Therefore, setting arcs_cand to NULL is equivalent to learning only the mixture structure (and the parameters) of the model.

col_seq

A character vector containing the column names of data that describe the observation sequence. If NULL (the default), all the observations belong to a single sequence. If gmgm is a gmdbn object, the observations of a same sequence must be ordered such that the tth one is related to time slice t (note that the sequences can have different lengths). If gmgm is a gmbn object, this argument is ignored.

score

A character string ("aic", "bic" or "loglik") corresponding to the scoring function.

n_part

A positive integer corresponding to the number of particles generated for each observation (if gmgm is a gmbn object) or observation sequence (if gmgm is a gmdbn object) during inference.

max_part_sim

An integer greater than or equal to n_part corresponding to the maximum number of particles that can be processed simultaneously during inference. This argument is used to prevent memory overflow, dividing data into smaller subsets that are handle sequentially.

min_ess

A numeric value in [0, 1] corresponding to the minimum ESS (expressed as a proportion of n_part) under which the renewal step of sequential importance resampling is performed. If 1 (the default), this step is performed at each time slice. If gmgm is a gmbn object, this argument is ignored.

max_iter_sem

A non-negative integer corresponding to the maximum number of iterations.

max_iter_pem

A non-negative integer corresponding to the maximum number of iterations of the parametric EM algorithm.

verbose

A logical value indicating whether iterations in progress are displayed.

...

Additional arguments passed to function stepwise.

Value

A list with elements:

gmgm

The final gmbn or gmdbn object (with the highest score).

data

A data frame (tibble) containing the complete data used to learn the final gmbn or gmdbn object.

seq_score

A numeric matrix containing the sequence of scores measured after the E and M steps of each iteration.

References

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.

See Also

param_em, param_learn, struct_learn

Examples


set.seed(0)
data(data_body)
data_1 <- data_body
data_1$GENDER[sample.int(2148, 430)] <- NA
data_1$AGE[sample.int(2148, 430)] <- NA
data_1$HEIGHT[sample.int(2148, 430)] <- NA
data_1$WEIGHT[sample.int(2148, 430)] <- NA
data_1$FAT[sample.int(2148, 430)] <- NA
data_1$WAIST[sample.int(2148, 430)] <- NA
data_1$GLYCO[sample.int(2148, 430)] <- NA
gmbn_1 <- add_nodes(NULL,
                    c("AGE", "FAT", "GENDER", "GLYCO", "HEIGHT", "WAIST",
                      "WEIGHT"))
arcs_cand_1 <- data.frame(from = c("AGE", "GENDER", "HEIGHT", "WEIGHT", NA,
                                   "AGE", "GENDER", "AGE", "FAT", "GENDER",
                                   "HEIGHT", "WEIGHT", "AGE", "GENDER",
                                   "HEIGHT"),
                          to = c("FAT", "FAT", "FAT", "FAT", "GLYCO", "HEIGHT",
                                 "HEIGHT", "WAIST", "WAIST", "WAIST", "WAIST",
                                 "WAIST", "WEIGHT", "WEIGHT", "WEIGHT"))
res_learn_1 <- struct_em(gmbn_1, data_1, arcs_cand = arcs_cand_1,
                         verbose = TRUE, max_comp = 3)

set.seed(0)
data(data_air)
data_2 <- data_air
data_2$NO2[sample.int(7680, 1536)] <- NA
data_2$O3[sample.int(7680, 1536)] <- NA
data_2$TEMP[sample.int(7680, 1536)] <- NA
data_2$WIND[sample.int(7680, 1536)] <- NA
gmdbn_1 <- gmdbn(b_2 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND")),
                 b_13 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND")))
arcs_cand_2 <- data.frame(from = c("NO2", "NO2", "NO2", "O3", "TEMP", "TEMP",
                                   "WIND", "WIND"),
                          to = c("NO2", "O3", "O3", "O3", NA, NA, NA, NA),
                          lag = c(1, 0, 1, 1, 0, 1, 0, 1))
res_learn_2 <- struct_em(gmdbn_1, data_2, arcs_cand = arcs_cand_2,
                         col_seq = "DATE", verbose = TRUE, max_comp = 3)


[Package gmgm version 1.1.2 Index]