auto_MrP {autoMrP}  R Documentation 
This package improves the prediction performance of multilevel regression with poststratification (MrP) by combining a number of machine learning methods through ensemble Bayesian model averaging (EBMA).
auto_MrP( y, L1.x, L2.x, L2.unit, L2.reg = NULL, L2.x.scale = TRUE, pcs = NULL, folds = NULL, bin.proportion = NULL, bin.size = NULL, survey, census, ebma.size = 1/3, cores = 1, k.folds = 5, cv.sampling = "L2 units", loss.unit = c("individuals", "L2 units"), loss.fun = c("msfe", "crossentropy", "f1", "MSE"), best.subset = TRUE, lasso = TRUE, pca = TRUE, gb = TRUE, svm = TRUE, mrp = FALSE, oversampling = FALSE, forward.select = FALSE, best.subset.L2.x = NULL, lasso.L2.x = NULL, pca.L2.x = NULL, gb.L2.x = NULL, svm.L2.x = NULL, mrp.L2.x = NULL, gb.L2.unit = TRUE, gb.L2.reg = FALSE, svm.L2.unit = TRUE, svm.L2.reg = FALSE, lasso.lambda = NULL, lasso.n.iter = 100, gb.interaction.depth = c(1, 2, 3), gb.shrinkage = c(0.04, 0.01, 0.008, 0.005, 0.001), gb.n.trees.init = 50, gb.n.trees.increase = 50, gb.n.trees.max = 1000, gb.n.minobsinnode = 20, svm.kernel = c("radial"), svm.gamma = NULL, svm.cost = NULL, ebma.n.draws = 100, ebma.tol = c(0.01, 0.005, 0.001, 5e04, 1e04, 5e05, 1e05), seed = NULL, verbose = FALSE, uncertainty = FALSE, boot.iter = NULL )
y 
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in 
L1.x 
Individuallevel covariates. A character vector containing the
column names of the individuallevel variables in 
L2.x 
Contextlevel covariates. A character vector containing the
column names of the contextlevel variables in 
L2.unit 
Geographic unit. A character scalar containing the column
name of the geographic unit in 
L2.reg 
Geographic region. A character scalar containing the column
name of the geographic region in 
L2.x.scale 
Scale contextlevel covariates. A logical argument
indicating whether the contextlevel covariates should be normalized.
Default is 
pcs 
Principal components. A character vector containing the column
names of the principal components of the contextlevel variables in

folds 
EBMA and crossvalidation folds. A character scalar containing
the column name of the variable in 
bin.proportion 
Proportion of ideal types. A character scalar
containing the column name of the variable in 
bin.size 
Bin size of ideal types. A character scalar containing the
column name of the variable in 
survey 
Survey data. A 
census 
Census data. A 
ebma.size 
EBMA fold size. A number in the open unit interval
indicating the proportion of respondents to be allocated to the EBMA fold.
Default is 1/3. Note: ignored if 
cores 
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. 
k.folds 
Number of crossvalidation folds. An integervalued scalar
indicating the number of folds to be used in crossvalidation. Default is
5. Note: ignored if 
cv.sampling 
Crossvalidation sampling method. A charactervalued
scalar indicating whether crossvalidation folds should be created by
sampling individual respondents ( 
loss.unit 
Loss function unit. A charactervalued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( 
loss.fun 
Loss function. A charactervalued scalar indicating whether
prediction loss should be measured by the mean squared error ( 
best.subset 
Best subset classifier. A logical argument indicating
whether the best subset classifier should be used for predicting outcome

lasso 
Lasso classifier. A logical argument indicating whether the
lasso classifier should be used for predicting outcome 
pca 
PCA classifier. A logical argument indicating whether the PCA
classifier should be used for predicting outcome 
gb 
GB classifier. A logical argument indicating whether the GB
classifier should be used for predicting outcome 
svm 
SVM classifier. A logical argument indicating whether the SVM
classifier should be used for predicting outcome 
mrp 
MRP classifier. A logical argument indicating whether the standard
MRP classifier should be used for predicting outcome 
oversampling 
Over sample to create balance on the dependent variable.
A logical argument. Default is 
forward.select 
Forward selection classifier. A logical argument
indicating whether to use forward selection rather than best subset
selection. Default is 
best.subset.L2.x 
Best subset contextlevel covariates. A character
vector containing the column names of the contextlevel variables in

lasso.L2.x 
Lasso contextlevel covariates. A character vector
containing the column names of the contextlevel variables in

pca.L2.x 
PCA contextlevel covariates. A character vector containing
the column names of the contextlevel variables in 
gb.L2.x 
GB contextlevel covariates. A character vector containing the
column names of the contextlevel variables in 
svm.L2.x 
SVM contextlevel covariates. A character vector containing
the column names of the contextlevel variables in 
mrp.L2.x 
MRP contextlevel covariates. A character vector containing
the column names of the contextlevel variables in 
gb.L2.unit 
GB L2.unit. A logical argument indicating whether

gb.L2.reg 
GB L2.reg. A logical argument indicating whether

svm.L2.unit 
SVM L2.unit. A logical argument indicating whether

svm.L2.reg 
SVM L2.reg. A logical argument indicating whether

lasso.lambda 
Lasso penalty parameter. A numeric 
lasso.n.iter 
Lasso number of lambda values. An integervalued scalar
specifying the number of lambda values to search over. Default is 100.
Note: Is ignored if a vector of 
gb.interaction.depth 
GB interaction depth. An integervalued vector
whose values specify the interaction depth of GB. The interaction depth
defines the maximum depth of each tree grown (i.e., the maximum level of
variable interactions). Default is 
gb.shrinkage 
GB learning rate. A numeric vector whose values specify
the learning rate or stepsize reduction of GB. Values between 0.001
and 0.1 usually work, but a smaller learning rate typically requires
more trees. Default is 
gb.n.trees.init 
GB initial total number of trees. An integervalued scalar specifying the initial number of total trees to fit by GB. Default is 50. 
gb.n.trees.increase 
GB increase in total number of trees. An
integervalued scalar specifying by how many trees the total number of
trees to fit should be increased (until 
gb.n.trees.max 
GB maximum number of trees. An integervalued scalar specifying the maximum number of trees to fit by GB. Default is 1000. 
gb.n.minobsinnode 
GB minimum number of observations in the terminal nodes. An integervalued scalar specifying the minimum number of observations that each terminal node of the trees must contain. Default is 20. 
svm.kernel 
SVM kernel. A charactervalued scalar specifying the kernel
to be used by SVM. The possible values are 
svm.gamma 
SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e5, maximum = 1e1, and length = 20 that is equally spaced on the logscale. 
svm.cost 
SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the logscale. 
ebma.n.draws 
EBMA number of samples. An integervalued scalar specifying the number of bootstrapped samples to be drawn from the EBMA fold and used for tuning EBMA. Default is 100. 
ebma.tol 
EBMA tolerance. A numeric vector containing the
tolerance values for improvements in the loglikelihood before the EM
algorithm stops optimization. Values should range at least from 0.01
to 0.001. Default is

seed 
Seed. Either 
verbose 
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is 
uncertainty 
Uncertainty estimates. A logical argument indicating
whether uncertainty estimates should be computed. Default is 
boot.iter 
Number of bootstrap iterations. An integer argument
indicating the number of bootstrap iterations to be computed. Will be
ignored unless 
The contextlevel predictions. A list with two elements. The first
element, EBMA
, contains the poststratified ensemble bayesian model
avaeraging (EBMA) predictions. The second element, classifiers
,
contains the poststratified predictions from all estimated classifiers.
# Minimal example without machine learning m < auto_MrP( y = "YES", L1.x = c("L1x1"), L2.x = c("L2.x1", "L2.x2"), L2.unit = "state", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, ebma.size = 0, cores = max_cores, best.subset = FALSE, lasso = FALSE, pca = FALSE, gb = FALSE, svm = FALSE, mrp = TRUE ) # summarize and plot results summary(m) plot(m) # MrP model only: mrp_out < auto_MrP( y = "YES", L1.x = c("L1x1", "L1x2", "L1x3"), L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"), L2.unit = "state", L2.reg = "region", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, ebma.size = 0, best.subset = FALSE, lasso = FALSE, pca = FALSE, gb = FALSE, svm = FALSE, mrp = TRUE ) # Predictions through machine learning # detect number of available cores max_cores < parallel::detectCores() # autoMrP with machine learning ml_out < auto_MrP( y = "YES", L1.x = c("L1x1", "L1x2", "L1x3"), L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"), L2.unit = "state", L2.reg = "region", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, gb.L2.reg = TRUE, svm.L2.reg = TRUE, cores = max_cores )