ebp {povmap} | R Documentation |
Empirical Best Prediction for Disaggregated Indicators
Description
Function ebp
estimates indicators using the Empirical Best Prediction
approach by Molina and Rao (2010). Point predictions of indicators
are obtained by Monte-Carlo approximations. Additionally, mean squared error
(MSE) estimation can be conducted by using a parametric bootstrap approach
(see also Gonzalez-Manteiga et al. (2008)). The unit-level model of
Battese, Harter and Fuller (1988) is fitted by the restricted maximum
likelihood (REML) method and one of
five different transformation types for the dependent variable can be chosen.
This approach can be extended to data under informative sampling using
weights and is based on Guadarrama et al. (2018). Model estimation
combines the uni-level model of Battese, Harter and Fuller (1988) and
the approach of You and Rao (2002) using survey weights. At the
moment, only the log-transformation is supported for this method.
Usage
ebp(
fixed,
pop_data,
pop_domains,
smp_data,
smp_domains,
L = 50,
threshold = NULL,
transformation = "box.cox",
interval = "default",
MSE = FALSE,
B = 50,
seed = 123,
boot_type = "parametric",
parallel_mode = ifelse(grepl("windows", .Platform$OS.type), "socket", "multicore"),
cpus = 1,
custom_indicator = NULL,
na.rm = FALSE,
weights = NULL,
pop_weights = NULL,
aggregate_to = NULL,
weights_type = "Guadarrama",
benchmark = NULL,
benchmark_type = "ratio",
benchmark_level = NULL,
benchmark_weights = NULL,
nlme_maxiter = 1000,
nlme_tolerance = 1e-06,
nlme_opt = "nlminb",
nlme_optimmethod = "BFGS",
nlme_method = "REML",
nlme_mstol = 1e-07,
nlme_msmaxiter = 1000,
nlme_returnobject = FALSE,
rescale_weights = FALSE,
Ydump = NULL
)
Arguments
fixed |
a two-sided linear formula object describing the
fixed-effects part of the nested error linear regression model with the
dependent variable on the left of a ~ operator and the explanatory
variables on the right, separated by + operators. The argument corresponds
to the argument |
pop_data |
a data frame that needs to comprise the variables
named on the right of the ~ operator in |
pop_domains |
a character string containing the name of a variable that
indicates domains in the population data. The variable can be numeric or
a factor but needs to be of the same class as the variable named in
|
smp_data |
a data frame that needs to comprise all variables named in
|
smp_domains |
a character string containing the name of a variable
that indicates domains in the sample data. The variable can be numeric or a
factor but needs to be of the same class as the variable named in
|
L |
a number determining the number of Monte-Carlo simulations that must be at least 1. Defaults to 50. For practical applications, values larger than 200 are recommended (see also Molina, I. and Rao, J.N.K. (2010)). |
threshold |
a number defining a threshold. Alternatively, a threshold
may be defined as a |
transformation |
a character string. Five different transformation
types for the dependent variable can be chosen (i) no transformation ("no");
(ii) log transformation ("log"); (iii) Box-Cox transformation ("box.cox");
(iv) Dual transformation ("dual"); (v) Log-Shift transformation
("log.shift"); (Vi) rank-order transformation ("ordernorm"). Defaults to
|
interval |
a string equal to 'default' or a numeric vector containing a
lower and upper limit determining an interval for the estimation of the
optimal parameter. The interval is passed to function
|
MSE |
if |
B |
a number determining the number of bootstrap populations in the parametric bootstrap approach (see also Gonzalez-Manteiga et al. (2008)) used in the MSE estimation. The number must be greater than 1. Defaults to 50. For practical applications, values larger than 200 are recommended (see also Molina, I. and Rao, J.N.K. (2010)). |
seed |
an integer to set the seed for the random number generator. For
the usage of random number generation, see Details. If seed is set to
|
boot_type |
character string to choose between different MSE estimation
procedures,currently a |
parallel_mode |
modus of parallelization, defaults to an automatic
selection of a suitable mode, depending on the operating system, if the
number of |
cpus |
number determining the kernels that are used for the
parallelization. Defaults to 1. For details, see
|
custom_indicator |
a list of functions containing the indicators to be
calculated additionally. Such functions must depend on the target variable
|
na.rm |
if |
weights |
a character string containing the name of a variable that
indicates weights in the sample data. If a character string is provided
a weighted version of the ebp will be used. The variable has to be numeric.
Defaults to |
pop_weights |
a character string containing the name of a variable that
indicates population weights in the populatation data. If a character string
is provided weighted indicators are estimated using population weights.
The variable has to be numeric. Defaults to |
aggregate_to |
a character string containing the name of a variable from
population data that indicates the target domain level for which the
results are to be displayed. The variable can be numeric or a factor.
Defaults to |
weights_type |
a character string. Two different methods for survey
weights are available (i) EBP under informative sampling from
Guadarrama et al. (2018) ("Guadarrama"); (ii) considering survey
weights by using the weighting options of |
benchmark |
The input depends on the type of benchmarking to be
performed.
(i) Benchmarking with a fixed value:
(a) with one value for each indicator: a named vector containing the numeric
benchmark value(s). The names of the vector matchs to the chosen indicators.
Benchmarking is available for |
benchmark_type |
a character indicating the type of benchmarking. Types
that can be chosen (i) Raking (" |
benchmark_level |
a character indicating the level at which the benchmarking is performed. This name must be represented in the sample and population data as variable name. |
benchmark_weights |
the name of variable containing benchmark weights. This is only possible for internal benchmarking and enable users to benchmark with weights differing from the survey weights (Default for weighting for internal benchmarking). |
nlme_maxiter |
an integer indicating the maximum number of iterations
the |
nlme_tolerance |
a real number indicating the tolerance criterion for the
the |
nlme_opt |
a string indicating the optimizer to be used by the |
nlme_optimmethod |
a string indicating the optimization method to be used
with the optim optimizer the |
nlme_method |
a string indicating the method to be used by the |
nlme_mstol |
a real number indicating the tolerance criterion for the
the optimization step of the |
nlme_msmaxiter |
an integer indicating the maximum number of iterations
for the optimization step of the |
nlme_returnobject |
a logical indicating whether the fitted object should be returned with a warning (instead of an error via stop()) when the maximum number of iterations is reached without convergence of the algorithm. Defaults to FALSE |
rescale_weights |
a logical indicating if the sample weights are scaled.
If |
Ydump |
a string specifying the name of a .csv file to save all simulated values of the dependent value, model predictions, and error terms used for point estimation. |
Details
For Monte-Carlo approximations and in the parametric bootstrap
approach random number generation is used. Thus, a seed is set by the
argument seed
.
The set of predefined indicators includes the mean, median, four further
quantiles (10%, 25%, 75% and 90%), head count ratio, poverty gap, Gini
coefficient and the quintile share ratio.
Since the sample observations often cannot be identified in practical
applications, a modified approach by Guadarrama et al. (2016) called census
EBP is implemented for the point estimation. For the MSE estimation, the
bootstrap sample is not extracted from the superpopulation, but generated by
the estimated model parameters. The lower the ratio between the sample and
the population size, the closer are the results to the proposed approach by
Molina and Rao (2010).
Value
An object of class "ebp", "emdi" that provides estimators for
regional disaggregated indicators and optionally corresponding MSE estimates.
Several generic functions have methods for the returned object. For a full
list and descriptions of the components of objects of class "emdi",
see emdiObject
.
References
Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An Error-Components
Model for Predictions of County Crop Areas Using Survey and Satellite Data.
Journal of the American Statistical Association, Vol.83, No. 401,
28-36.
Gonzalez-Manteiga, W. et al. (2008). Bootstrap mean squared error of
a small-area EBLUP. Journal of Statistical Computation and Simulation,
78:5, 443-462.
Guadarrama, M., Molina, I. and Rao, J.N.K. (2016). A comparison of small area
estimation methods for poverty mapping. Joint Issue: Statistics in Transition
New Series Survey Methodology, Vol.17, No. 1, 41–66.
Guadarrama, M., Molina, I. and Rao, J.N.K. (2018). Small area estimation of
general parameters under complex sampling designs. Computational Statistics &
Data Analysis, Vol. 121, 20-40.
Kreutzmann, A., Pannier, S., Rojas-Perilla, N., Schmid, T., Templ, M.
and Tzavidis, N. (2019). The R Package emdi for Estimating and
Mapping Regionally Disaggregated Indicators, Journal of Statistical Software,
Vol. 91, No. 7, 1–33, <doi:10.18637/jss.v091.i07>
Molina, I. and Rao, J.N.K. (2010). Small area estimation of poverty
indicators. The Canadian Journal of Statistics, Vol. 38, No.3,
369-385.
Social Protection Committee (2001). Report on indicators in the field of
poverty and social exclusions, Technical Report, European Union.
You, Y., Rao, J.N.K. (2002). A pseudo-empirical best linear unbiased
prediction approach to small area estimation using survey weights. The
Canadian Journal of Statistics. Vol. 30, No. 3, 431–439.
See Also
emdiObject
, lme
,
estimators.emdi
, plot.emdi
,
emdi_summaries
Examples
# Loading data - population and sample data
data("eusilcA_pop")
data("eusilcA_smp")
# Example 1: With default setting but na.rm=TRUE
emdi_model <- ebp(
fixed = eqIncome ~ gender + eqsize + cash + self_empl +
unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
na.rm = TRUE
)
# Example 2: With MSE, two additional indicators and function as threshold -
# Please note that the example runs for several minutes. For a short check
# change L and B to lower values.
emdi_model <- ebp(
fixed = eqIncome ~ gender + eqsize + cash +
self_empl + unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent +
fam_allow + house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
threshold = function(y) {
0.6 * median(y)
}, transformation = "log",
L = 50, MSE = TRUE, boot_type = "wild", B = 50, custom_indicator =
list(
my_max = function(y) {
max(y)
},
my_min = function(y) {
min(y)
}
), na.rm = TRUE, cpus = 1, nlme_opt="optim"
)
# Example 3: With default setting but na.rm=TRUE under informative sampling.
emdi_model <- ebp(
fixed = eqIncome ~ gender + eqsize + cash + self_empl +
unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
weights = "weight", transformation = "log", na.rm = TRUE
)
# Example 4: With default setting and random effect on the district level
# while the output is at state level
emdi_model <- ebp(
fixed = eqIncome ~ gender + eqsize + cash + self_empl +
unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
na.rm = TRUE, aggregate_to = "state"
)
# Example 5: With default setting using pop_weights to get weighted
# indicators according to equivalized household size and an using an
# custom_indicator using pop_weights
emdi_model <- ebp(
fixed = eqIncome ~ gender + eqsize + cash + self_empl +
unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
custom_indicator =
list(HCR_singleHH = function(y, pop_weights, threshold) {
mean(y[pop_weights == 1] < threshold)
}
), na.rm = TRUE, pop_weights = "hhsize"
)