R: Improve MrP through ensemble learning.

auto_MrP {autoMrP}

R Documentation

Improve MrP through ensemble learning.

Description

This package improves the prediction performance of multilevel regression with post-stratification (MrP) by combining a number of machine learning methods through ensemble Bayesian model averaging (EBMA).

Usage

auto_MrP(
  y,
  L1.x,
  L2.x,
  L2.unit,
  L2.reg = NULL,
  L2.x.scale = TRUE,
  pcs = NULL,
  folds = NULL,
  bin.proportion = NULL,
  bin.size = NULL,
  survey,
  census,
  ebma.size = 1/3,
  cores = 1,
  k.folds = 5,
  cv.sampling = "L2 units",
  loss.unit = c("individuals", "L2 units"),
  loss.fun = c("msfe", "cross-entropy", "f1", "MSE"),
  best.subset = TRUE,
  lasso = TRUE,
  pca = TRUE,
  gb = TRUE,
  svm = TRUE,
  mrp = FALSE,
  deep.mrp = FALSE,
  oversampling = FALSE,
  best.subset.L2.x = NULL,
  lasso.L2.x = NULL,
  pca.L2.x = NULL,
  gb.L2.x = NULL,
  svm.L2.x = NULL,
  mrp.L2.x = NULL,
  gb.L2.unit = TRUE,
  gb.L2.reg = FALSE,
  svm.L2.unit = TRUE,
  svm.L2.reg = FALSE,
  deep.L2.x = NULL,
  deep.L2.reg = TRUE,
  deep.splines = TRUE,
  lasso.lambda = NULL,
  lasso.n.iter = 100,
  gb.interaction.depth = c(1, 2, 3),
  gb.shrinkage = c(0.04, 0.01, 0.008, 0.005, 0.001),
  gb.n.trees.init = 50,
  gb.n.trees.increase = 50,
  gb.n.trees.max = 1000,
  gb.n.minobsinnode = 20,
  svm.kernel = c("radial"),
  svm.gamma = NULL,
  svm.cost = NULL,
  ebma.n.draws = 100,
  ebma.tol = c(0.01, 0.005, 0.001, 5e-04, 1e-04, 5e-05, 1e-05),
  verbose = FALSE,
  uncertainty = FALSE,
  boot.iter = NULL
)

Arguments

`y`	Outcome variable. A character vector containing the column names of the outcome variable. A character scalar containing the column name of the outcome variable in `survey`.
`L1.x`	Individual-level covariates. A character vector containing the column names of the individual-level variables in `survey` and `census` used to predict outcome `y`. Note that geographic unit is specified in argument `L2.unit`.
`L2.x`	Context-level covariates. A character vector containing the column names of the context-level variables in `survey` and `census` used to predict outcome `y`. To exclude context-level variables, set `L2.x = NULL`.
`L2.unit`	Geographic unit. A character scalar containing the column name of the geographic unit in `survey` and `census` at which outcomes should be aggregated.
`L2.reg`	Geographic region. A character scalar containing the column name of the geographic region in `survey` and `census` by which geographic units are grouped (`L2.unit` must be nested within `L2.reg`). Default is `NULL`.
`L2.x.scale`	Scale context-level covariates. A logical argument indicating whether the context-level covariates should be normalized. Default is `TRUE`. Note that if set to `FALSE`, then the context-level covariates should be normalized prior to calling `auto_MrP()`.
`pcs`	Principal components. A character vector containing the column names of the principal components of the context-level variables in `survey` and `census`. Default is `NULL`.
`folds`	EBMA and cross-validation folds. A character scalar containing the column name of the variable in `survey` that specifies the fold to which an observation is allocated. The variable should contain integers running from `1` to `k + 1`, where `k` is the number of cross-validation folds. Value `k + 1` refers to the EBMA fold. Default is `NULL`. Note: if `folds` is `NULL`, then `ebma.size`, `k.folds`, and `cv.sampling` must be specified.
`bin.proportion`	Proportion of ideal types. A character scalar containing the column name of the variable in `census` that indicates the proportion of individuals by ideal type and geographic unit. Default is `NULL`. Note: if `bin.proportion` is `NULL`, then `bin.size` must be specified.
`bin.size`	Bin size of ideal types. A character scalar containing the column name of the variable in `census` that indicates the bin size of ideal types by geographic unit. Default is `NULL`. Note: ignored if `bin.proportion` is provided, but must be specified otherwise.
`survey`	Survey data. A `data.frame` whose column names include `y`, `L1.x`, `L2.x`, `L2.unit`, and, if specified, `L2.reg`, `pcs`, and `folds`.
`census`	Census data. A `data.frame` whose column names include `L1.x`, `L2.x`, `L2.unit`, if specified, `L2.reg` and `pcs`, and either `bin.proportion` or `bin.size`.
`ebma.size`	EBMA fold size. A number in the open unit interval indicating the proportion of respondents to be allocated to the EBMA fold. Default is `1/3`. Note: ignored if `folds` is provided, but must be specified otherwise.
`cores`	The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1.
`k.folds`	Number of cross-validation folds. An integer-valued scalar indicating the number of folds to be used in cross-validation. Default is `5`. Note: ignored if `folds` is provided, but must be specified otherwise.
`cv.sampling`	Cross-validation sampling method. A character-valued scalar indicating whether cross-validation folds should be created by sampling individual respondents (`individuals`) or geographic units (`L2 units`). Default is `L2 units`. Note: ignored if `folds` is provided, but must be specified otherwise.
`loss.unit`	Loss function unit. A character-valued scalar indicating whether performance loss should be evaluated at the level of individual respondents (`individuals`), geographic units (`L2 units`) or at both levels. Default is `c("individuals", "L2 units")`. With multiple loss units, parameters are ranked for each loss unit and the loss unit with the lowest rank sum is chosen. Ties are broken according to the order in the search grid.
`loss.fun`	Loss function. A character-valued scalar indicating whether prediction loss should be measured by the mean squared error (`MSE`), the mean absolute error (`MAE`), binary cross-entropy (`cross-entropy`), mean squared false error (`msfe`), the f1 score (`f1`), or a combination thereof. Default is `c("MSE", "cross-entropy","msfe", "f1")`. With multiple loss functions, parameters are ranked for each loss function and the parameter combination with the lowest rank sum is chosen. Ties are broken according to the order in the search grid.
`best.subset`	Best subset classifier. A logical argument indicating whether the best subset classifier should be used for predicting outcome `y`. Default is `TRUE`.
`lasso`	Lasso classifier. A logical argument indicating whether the lasso classifier should be used for predicting outcome `y`. Default is `TRUE`.
`pca`	PCA classifier. A logical argument indicating whether the PCA classifier should be used for predicting outcome `y`. Default is `TRUE`.
`gb`	GB classifier. A logical argument indicating whether the GB classifier should be used for predicting outcome `y`. Default is `TRUE`.
`svm`	SVM classifier. A logical argument indicating whether the SVM classifier should be used for predicting outcome `y`. Default is `TRUE`.
`mrp`	MRP classifier. A logical argument indicating whether the standard MRP classifier should be used for predicting outcome `y`. Default is `FALSE`.
`deep.mrp`	Deep MRP classifier. A logical argument indicating whether the deep MRP classifier should be used for predicting outcome `y`. Default is `FALSE`.
`oversampling`	Over sample to create balance on the dependent variable. A logical argument. Default is `FALSE`.
`best.subset.L2.x`	Best subset context-level covariates. A character vector containing the column names of the context-level variables in `survey` and `census` to be used by the best subset classifier. If `NULL` and `best.subset` is set to `TRUE`, then best subset uses the variables specified in `L2.x`. Default is `NULL`.
`lasso.L2.x`	Lasso context-level covariates. A character vector containing the column names of the context-level variables in `survey` and `census` to be used by the lasso classifier. If `NULL` and `lasso` is set to `TRUE`, then lasso uses the variables specified in `L2.x`. Default is `NULL`.
`pca.L2.x`	PCA context-level covariates. A character vector containing the column names of the context-level variables in `survey` and `census` whose principal components are to be used by the PCA classifier. If `NULL` and `pca` is set to `TRUE`, then PCA uses the principal components of the variables specified in `L2.x`. Default is `NULL`.
`gb.L2.x`	GB context-level covariates. A character vector containing the column names of the context-level variables in `survey` and `census` to be used by the GB classifier. If `NULL` and `gb` is set to `TRUE`, then GB uses the variables specified in `L2.x`. Default is `NULL`.
`svm.L2.x`	SVM context-level covariates. A character vector containing the column names of the context-level variables in `survey` and `census` to be used by the SVM classifier. If `NULL` and `svm` is set to `TRUE`, then SVM uses the variables specified in `L2.x`. Default is `NULL`.
`mrp.L2.x`	MRP context-level covariates. A character vector containing the column names of the context-level variables in `survey` and `census` to be used by the MRP classifier. The character vector empty if no context-level variables should be used by the MRP classifier. If `NULL` and `mrp` is set to `TRUE`, then MRP uses the variables specified in `L2.x`. Default is `NULL`. Note: For the empty MrP model, set `L2.x = NULL` and `mrp.L2.x = ""`.
`gb.L2.unit`	GB L2.unit. A logical argument indicating whether `L2.unit` should be included in the GB classifier. Default is `FALSE`.
`gb.L2.reg`	GB L2.reg. A logical argument indicating whether `L2.reg` should be included in the GB classifier. Default is `FALSE`.
`svm.L2.unit`	SVM L2.unit. A logical argument indicating whether `L2.unit` should be included in the SVM classifier. Default is `FALSE`.
`svm.L2.reg`	SVM L2.reg. A logical argument indicating whether `L2.reg` should be included in the SVM classifier. Default is `FALSE`.
`deep.L2.x`	Deep MRP context-level covariates. A character vector containing the column names of the context-level variables in `survey` and `census` to be used by the deep MRP classifier. If `NULL` and `deep.mrp` is set to `TRUE`, then deep MRP uses the variables specified in `L2.x`. Default is `NULL`.
`deep.L2.reg`	Deep MRP L2.reg. A logical argument indicating whether `L2.reg` should be included in the deep MRP classifier. Default is `TRUE`.
`deep.splines`	Deep MRP splines. A logical argument indicating whether splines should be used in the deep MRP classifier. Default is `TRUE`.
`lasso.lambda`	Lasso penalty parameter. A numeric `vector` of non-negative values. The penalty parameter controls the shrinkage of the context-level variables in the lasso model. Default is a sequence with minimum 0.1 and maximum 250 that is equally spaced on the log-scale. The number of values is controlled by the `lasso.n.iter` parameter.
`lasso.n.iter`	Lasso number of lambda values. An integer-valued scalar specifying the number of lambda values to search over. Default is `100`. Note: Is ignored if a vector of `lasso.lambda` values is provided.
`gb.interaction.depth`	GB interaction depth. An integer-valued vector whose values specify the interaction depth of GB. The interaction depth defines the maximum depth of each tree grown (i.e., the maximum level of variable interactions). Default is `c(1, 2, 3)`.
`gb.shrinkage`	GB learning rate. A numeric vector whose values specify the learning rate or step-size reduction of GB. Values between `0.001` and `0.1` usually work, but a smaller learning rate typically requires more trees. Default is `c(0.04, 0.01, 0.008, 0.005, 0.001)`.
`gb.n.trees.init`	GB initial total number of trees. An integer-valued scalar specifying the initial number of total trees to fit by GB. Default is `50`.
`gb.n.trees.increase`	GB increase in total number of trees. An integer-valued scalar specifying by how many trees the total number of trees to fit should be increased (until `gb.n.trees.max` is reached). Default is `50`.
`gb.n.trees.max`	GB maximum number of trees. An integer-valued scalar specifying the maximum number of trees to fit by GB. Default is `1000`.
`gb.n.minobsinnode`	GB minimum number of observations in the terminal nodes. An integer-valued scalar specifying the minimum number of observations that each terminal node of the trees must contain. Default is `20`.
`svm.kernel`	SVM kernel. A character-valued scalar specifying the kernel to be used by SVM. The possible values are `linear`, `polynomial`, `radial`, and `sigmoid`. Default is `radial`.
`svm.gamma`	SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e-5, maximum = 1e-1, and length = 20 that is equally spaced on the log-scale.
`svm.cost`	SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the log-scale.
`ebma.n.draws`	EBMA number of samples. An integer-valued scalar specifying the number of bootstrapped samples to be drawn from the EBMA fold and used for tuning EBMA. Default is `100`.
`ebma.tol`	EBMA tolerance. A numeric vector containing the tolerance values for improvements in the log-likelihood before the EM algorithm stops optimization. Values should range at least from `0.01` to `0.001`. Default is `c(0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001)`.
`verbose`	Verbose output. A logical argument indicating whether or not verbose output should be printed. Default is `FALSE`.
`uncertainty`	Uncertainty estimates. A logical argument indicating whether uncertainty estimates should be computed. Default is `FALSE`.
`boot.iter`	Number of bootstrap iterations. An integer argument indicating the number of bootstrap iterations to be computed. Will be ignored unless `uncertainty = TRUE`. Default is `200` if `uncertainty = TRUE` and `NULL` if `uncertainty = FALSE`.

Details

Bootstrapping samples the level two units, sometimes referred to as the cluster bootstrap. For the multilevel model, for example, when running MrP only, the bootstrapped median level two predictions will differ from the level two predictions without bootstrapping. We recommend assessing the difference by running autoMrP without bootstrapping alongside autoMrP with bootstrapping and then comparing level two predictions from the model without bootstrapping to the median level two predictions from the model with bootstrapping.

To ensure reproducability of the results, use the set.seed() function to specify a seed.

Value

The context-level predictions. A list with two elements. The first element, EBMA, contains the post-stratified ensemble bayesian model avaeraging (EBMA) predictions. The second element, classifiers, contains the post-stratified predictions from all estimated classifiers.

Examples

# An MrP model without machine learning
set.seed(123)
m <- auto_MrP(
  y = "YES",
  L1.x = c("L1x1"),
  L2.x = c("L2.x1", "L2.x2"),
  L2.unit = "state",
  bin.proportion = "proportion",
  survey = taxes_survey,
  census = taxes_census,
  ebma.size = 0,
  cores = 2,
  best.subset = FALSE,
  lasso = FALSE,
  pca = FALSE,
  gb = FALSE,
  svm = FALSE,
  mrp = TRUE
)

# summarize and plot results
summary(m)
plot(m)

# An MrP model without context-level predictors
m <- auto_MrP(
  y = "YES",
  L1.x = "L1x1",
  L2.x = NULL,
  mrp.L2.x = "",
  L2.unit = "state",
  bin.proportion = "proportion",
  survey = taxes_survey,
  census = taxes_census,
  ebma.size = 0,
  cores = 1,
  best.subset = FALSE,
  lasso = FALSE,
  pca = FALSE,
  gb = FALSE,
  svm = FALSE,
  mrp = TRUE
  )


# Predictions with machine learning

# detect number of available cores
max_cores <- parallel::detectCores()

# autoMrP with machine learning
ml_out <- auto_MrP(
  y = "YES",
  L1.x = c("L1x1", "L1x2", "L1x3"),
  L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"),
  L2.unit = "state",
  L2.reg = "region",
  bin.proportion = "proportion",
  survey = taxes_survey,
  census = taxes_census,
  gb.L2.reg = TRUE,
  svm.L2.reg = TRUE,
  cores = min(2, max_cores)
  )

[Package autoMrP version 1.0.6 Index]