R: Learn the parameters of a Gaussian mixture graphical model...

param_em {gmgm}

R Documentation

Learn the parameters of a Gaussian mixture graphical model with incomplete data

Description

This function learns the parameters of a Gaussian mixture graphical model with incomplete data using the parametric EM algorithm. At each iteration, inference (smoothing inference for a dynamic Bayesian network) is performed to complete the data given the current estimate of the parameters (E step). The completed data are then used to update the parameters (M step), and so on. Each iteration is guaranteed to increase the log-likelihood until convergence to a local maximum (Koller and Friedman, 2009). In practice, due to the sampling process inherent in particle-based inference, it may happen that the monotonic increase no longer occurs when approaching the local maximum, resulting in an earlier termination of the algorithm.

Usage

param_em(
  gmgm,
  data,
  nodes = structure(gmgm)$nodes,
  col_seq = NULL,
  n_part = 1000,
  max_part_sim = 1e+06,
  min_ess = 1,
  max_iter_pem = 5,
  verbose = FALSE,
  ...
)

Arguments

`gmgm`	An object of class `gmbn` (non-temporal) or `gmdbn`.
`data`	A data frame containing the data used for learning. Its columns must explicitly be named after nodes of `gmgm` and can contain missing values (columns with no value can be removed).
`nodes`	A character vector containing the nodes whose local conditional models are learned (by default all the nodes of `gmgm`). If `gmgm` is a `gmdbn` object, the same nodes are learned for each of its `gmbn` elements. This constraint can be overcome by passing a list of character vectors named after some of these elements (`b_1`, ...) and containing learned nodes specific to them.
`col_seq`	A character vector containing the column names of `data` that describe the observation sequence. If `NULL` (the default), all the observations belong to a single sequence. If `gmgm` is a `gmdbn` object, the observations of a same sequence must be ordered such that the `t`th one is related to time slice `t` (note that the sequences can have different lengths). If `gmgm` is a `gmbn` object, this argument is ignored.
`n_part`	A positive integer corresponding to the number of particles generated for each observation (if `gmgm` is a `gmbn` object) or observation sequence (if `gmgm` is a `gmdbn` object) during inference.
`max_part_sim`	An integer greater than or equal to `n_part` corresponding to the maximum number of particles that can be processed simultaneously during inference. This argument is used to prevent memory overflow, dividing `data` into smaller subsets that are handle sequentially.
`min_ess`	A numeric value in [0, 1] corresponding to the minimum ESS (expressed as a proportion of `n_part`) under which the renewal step of sequential importance resampling is performed. If `1` (the default), this step is performed at each time slice. If `gmgm` is a `gmbn` object, this argument is ignored.
`max_iter_pem`	A non-negative integer corresponding to the maximum number of iterations.
`verbose`	A logical value indicating whether iterations in progress are displayed.
`...`	Additional arguments passed to function `em`.

Value

A list with elements:

`gmgm`	The final `gmbn` or `gmdbn` object (with the highest log-likelihood).
`data`	A data frame (tibble) containing the complete data used to learn the final `gmbn` or `gmdbn` object.
`seq_loglik`	A numeric matrix containing the sequence of log-likelihoods measured after the E and M steps of each iteration.

References

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.

Examples


set.seed(0)
data(data_body)
data_1 <- data_body
data_1$GENDER[sample.int(2148, 430)] <- NA
data_1$AGE[sample.int(2148, 430)] <- NA
data_1$HEIGHT[sample.int(2148, 430)] <- NA
data_1$WEIGHT[sample.int(2148, 430)] <- NA
data_1$FAT[sample.int(2148, 430)] <- NA
data_1$WAIST[sample.int(2148, 430)] <- NA
data_1$GLYCO[sample.int(2148, 430)] <- NA
gmbn_1 <- gmbn(
  AGE = split_comp(add_var(NULL, data_1[, "AGE"]), n_sub = 3),
  FAT = split_comp(add_var(NULL,
                           data_1[, c("FAT", "GENDER", "HEIGHT", "WEIGHT")]),
                   n_sub = 2),
  GENDER = split_comp(add_var(NULL, data_1[, "GENDER"]), n_sub = 2),
  GLYCO = split_comp(add_var(NULL, data_1[, c("GLYCO", "AGE", "WAIST")]),
                     n_sub = 2),
  HEIGHT = split_comp(add_var(NULL, data_1[, c("HEIGHT", "GENDER")])),
  WAIST = split_comp(add_var(NULL,
                             data_1[, c("WAIST", "AGE", "FAT", "HEIGHT",
                                        "WEIGHT")]),
                     n_sub = 3),
  WEIGHT = split_comp(add_var(NULL, data_1[, c("WEIGHT", "HEIGHT")]), n_sub = 2)
)
res_learn_1 <- param_em(gmbn_1, data_1, verbose = TRUE)

library(dplyr)
set.seed(0)
data(data_air)
data_2 <- data_air
data_2$NO2[sample.int(7680, 1536)] <- NA
data_2$O3[sample.int(7680, 1536)] <- NA
data_2$TEMP[sample.int(7680, 1536)] <- NA
data_2$WIND[sample.int(7680, 1536)] <- NA
data_3 <- data_2 %>%
  group_by(DATE) %>%
  mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP),
         WIND.1 = lag(WIND)) %>%
  ungroup()
gmdbn_1 <- gmdbn(
  b_2 = gmbn(
    NO2 = split_comp(add_var(NULL, data_3[, c("NO2", "NO2.1", "WIND")]),
                     n_sub = 3),
    O3 = split_comp(add_var(NULL,
                            data_3[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP",
                                       "TEMP.1")]),
                    n_sub = 3),
    TEMP = split_comp(add_var(NULL, data_3[, c("TEMP", "TEMP.1")]), n_sub = 3),
    WIND = split_comp(add_var(NULL, data_3[, c("WIND", "WIND.1")]), n_sub = 3)
  ),
  b_13 = gmbn(
    NO2 = split_comp(add_var(NULL, data_3[, c("NO2", "NO2.1", "WIND")]),
                     n_sub = 3),
    O3 = split_comp(add_var(NULL,
                            data_3[, c("O3", "O3.1", "TEMP", "TEMP.1",
                                       "WIND")]),
                    n_sub = 3),
    TEMP = split_comp(add_var(NULL, data_3[, c("TEMP", "TEMP.1")]), n_sub = 3),
    WIND = split_comp(add_var(NULL, data_3[, c("WIND", "WIND.1")]), n_sub = 3)
  )
)
res_learn_2 <- param_em(gmdbn_1, data_2, col_seq = "DATE", verbose = TRUE)

[Package gmgm version 1.1.2 Index]