R: Empirical Best Prediction for Disaggregated Indicators

ebp {emdi}

R Documentation

Empirical Best Prediction for Disaggregated Indicators

Description

Function ebp estimates indicators using the Empirical Best Prediction approach by Molina and Rao (2010). Point predictions of indicators are obtained by Monte-Carlo approximations. Additionally, mean squared error (MSE) estimation can be conducted by using a parametric bootstrap approach (see also Gonzalez-Manteiga et al. (2008)). The unit-level model of Battese, Harter and Fuller (1988) is fitted by the restricted maximum likelihood (REML) method and one of five different transformation types for the dependent variable can be chosen. This approach can be extended to data under informative sampling using weights and is based on Guadarrama et al. (2018). Model estimation combines the uni-level model of Battese, Harter and Fuller (1988) and the approach of You and Rao (2002) using survey weights. At the moment, only the log-transformation is supported for this method.

Usage

ebp(
  fixed,
  pop_data,
  pop_domains,
  smp_data,
  smp_domains,
  L = 50,
  threshold = NULL,
  transformation = "box.cox",
  interval = "default",
  MSE = FALSE,
  B = 50,
  seed = 123,
  boot_type = "parametric",
  parallel_mode = ifelse(grepl("windows", .Platform$OS.type), "socket", "multicore"),
  cpus = 1,
  custom_indicator = NULL,
  na.rm = FALSE,
  weights = NULL,
  pop_weights = NULL,
  aggregate_to = NULL
)

Arguments

`fixed`	a two-sided linear formula object describing the fixed-effects part of the nested error linear regression model with the dependent variable on the left of a ~ operator and the explanatory variables on the right, separated by + operators. The argument corresponds to the argument `fixed` in function `lme`.
`pop_data`	a data frame that needs to comprise the variables named on the right of the ~ operator in `fixed`, i.e. the explanatory variables, and `pop_domains`.
`pop_domains`	a character string containing the name of a variable that indicates domains in the population data. The variable can be numeric or a factor but needs to be of the same class as the variable named in `smp_domains`.
`smp_data`	a data frame that needs to comprise all variables named in `fixed` and `smp_domains`.
`smp_domains`	a character string containing the name of a variable that indicates domains in the sample data. The variable can be numeric or a factor but needs to be of the same class as the variable named in `pop_domains`.
`L`	a number determining the number of Monte-Carlo simulations that must be at least 1. Defaults to 50. For practical applications, values larger than 200 are recommended (see also Molina, I. and Rao, J.N.K. (2010)).
`threshold`	a number defining a threshold. Alternatively, a threshold may be defined as a `function` of `y` returning a numeric value. Such a function will be evaluated once for the point estimation and in each iteration of the parametric bootstrap. A threshold is needed for calculation e.g. of head count ratios and poverty gaps. The argument defaults to `NULL`. In this case, the threshold is set to 60% of the median of the variable that is selected as dependent variable similary to the at-risk-of-poverty rate used in the EU (see also Social Protection Committee 2001). However, any desired threshold can be chosen.
`transformation`	a character string. Five different transformation types for the dependent variable can be chosen (i) no transformation ("no"); (ii) log transformation ("log"); (iii) Box-Cox transformation ("box.cox"); (iv) Dual transformation ("dual"); (v) Log-Shift transformation ("log.shift"). Defaults to `"box.cox"`.
`interval`	a string equal to 'default' or a numeric vector containing a lower and upper limit determining an interval for the estimation of the optimal parameter. The interval is passed to function `optimize` for the optimization. Defaults to 'default' which equals c(-1,2) for Box-Cox, c(0,2) for Dual and an interval based on the range of y for Log-Shift transformation. If the convergence fails, it is often advisable to choose a smaller more suitable interval. For right skewed distributions, the negative values may be excluded, also values larger than 1 are seldom observed.
`MSE`	if `TRUE`, MSE estimates using a parametric bootstrap approach are calculated (see also Gonzalez-Manteiga et al. (2008)). Defaults to `FALSE`.
`B`	a number determining the number of bootstrap populations in the parametric bootstrap approach (see also Gonzalez-Manteiga et al. (2008)) used in the MSE estimation. The number must be greater than 1. Defaults to 50. For practical applications, values larger than 200 are recommended (see also Molina, I. and Rao, J.N.K. (2010)).
`seed`	an integer to set the seed for the random number generator. For the usage of random number generation, see Details. If seed is set to `NULL`, seed is chosen randomly. Defaults to `123`.
`boot_type`	character string to choose between different MSE estimation procedures,currently a `"parametric"` and a semi-parametric `"wild"` bootstrap are possible. Defaults to `"parametric"`.
`parallel_mode`	modus of parallelization, defaults to an automatic selection of a suitable mode, depending on the operating system, if the number of `cpus` is chosen higher than 1. For details, see `parallelStart`.
`cpus`	number determining the kernels that are used for the parallelization. Defaults to 1. For details, see `parallelStart`.
`custom_indicator`	a list of functions containing the indicators to be calculated additionally. Such functions must depend on the target variable `y`, and optional can depend on `pop_weights` and the `threshold`. Defaults to `NULL`.
`na.rm`	if `TRUE`, observations with `NA` values are deleted from the population and sample data. For the EBP procedure complete observations are required. Defaults to `FALSE`.
`weights`	a character string containing the name of a variable that indicates weights in the sample data. If a character string is provided a weighted version of the ebp will be used. The variable has to be numeric. Defaults to `NULL`.
`pop_weights`	a character string containing the name of a variable that indicates population weights in the populatation data. If a character string is provided weighted indicators are estimated using population weights. The variable has to be numeric. Defaults to `NULL`.
`aggregate_to`	a character string containing the name of a variable from population data that indicates the target domain level for which the results are to be displayed. The variable can be numeric or a factor. Defaults to `NULL`.

Details

For Monte-Carlo approximations and in the parametric bootstrap approach random number generation is used. Thus, a seed is set by the argument seed.

The set of predefined indicators includes the mean, median, four further quantiles (10%, 25%, 75% and 90%), head count ratio, poverty gap, Gini coefficient and the quintile share ratio.

Since the sample observations often cannot be identified in practical applications, a modified approach by Guadarrama et al. (2016) called census EBP is implemented for the point estimation. For the MSE estimation, the bootstrap sample is not extracted from the superpopulation, but generated by the estimated model parameters. The lower the ratio between the sample and the population size, the closer are the results to the proposed approach by Molina and Rao (2010).

Value

An object of class "ebp", "emdi" that provides estimators for regional disaggregated indicators and optionally corresponding MSE estimates. Several generic functions have methods for the returned object. For a full list and descriptions of the components of objects of class "emdi", see emdiObject.

References

Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An Error-Components Model for Predictions of County Crop Areas Using Survey and Satellite Data. Journal of the American Statistical Association, Vol.83, No. 401, 28-36.

Gonzalez-Manteiga, W. et al. (2008). Bootstrap mean squared error of a small-area EBLUP. Journal of Statistical Computation and Simulation, 78:5, 443-462.

Guadarrama, M., Molina, I. and Rao, J.N.K. (2016). A comparison of small area estimation methods for poverty mapping. Joint Issue: Statistics in Transition New Series Survey Methodology, Vol.17, No. 1, 41–66.

Guadarrama, M., Molina, I. and Rao, J.N.K. (2018). Small area estimation of general parameters under complex sampling designs. Computational Statistics & Data Analysis, Vol. 121, 20-40.

Kreutzmann, A., Pannier, S., Rojas-Perilla, N., Schmid, T., Templ, M. and Tzavidis, N. (2019). The R Package emdi for Estimating and Mapping Regionally Disaggregated Indicators, Journal of Statistical Software, Vol. 91, No. 7, 1–33, <doi:10.18637/jss.v091.i07>

Molina, I. and Rao, J.N.K. (2010). Small area estimation of poverty indicators. The Canadian Journal of Statistics, Vol. 38, No.3, 369-385.

Social Protection Committee (2001). Report on indicators in the field of poverty and social exclusions, Technical Report, European Union. You, Y., Rao, J.N.K. (2002). A pseudo-empirical best linear unbiased prediction approach to small area estimation using survey weights. The Canadian Journal of Statistics. Vol. 30, No. 3, 431–439.

Examples


# Loading data - population and sample data
data("eusilcA_pop")
data("eusilcA_smp")

# Example 1: With default setting but na.rm=TRUE
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash + self_empl +
    unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
    house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  na.rm = TRUE
)


# Example 2: With MSE, two additional indicators and function as threshold -
# Please note that the example runs for several minutes. For a short check
# change L and B to lower values.
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash +
    self_empl + unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent +
    fam_allow + house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  threshold = function(y) {
    0.6 * median(y)
  }, transformation = "log",
  L = 50, MSE = TRUE, boot_type = "wild", B = 50, custom_indicator =
    list(
      my_max = function(y) {
        max(y)
      },
      my_min = function(y) {
        min(y)
      }
    ), na.rm = TRUE, cpus = 1
)

# Example 3: With default setting but na.rm=TRUE under informative sampling.
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash + self_empl +
    unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
    house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  weights = "weight", transformation = "log", na.rm = TRUE
)

# Example 4: With default setting and random effect on the district level
# while the output is at state level
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash + self_empl +
    unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
    house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  na.rm = TRUE, aggregate_to = "state"
)

# Example 5: With default setting using pop_weights to get weighted
# indicators according to equivalized household size and an using an
# custom_indicator using pop_weights
emdi_model <- ebp(
  fixed = eqIncome ~ gender + eqsize + cash + self_empl +
    unempl_ben + age_ben + surv_ben + sick_ben + dis_ben + rent + fam_allow +
    house_allow + cap_inv + tax_adj, pop_data = eusilcA_pop,
  pop_domains = "district", smp_data = eusilcA_smp, smp_domains = "district",
  custom_indicator =
    list(HCR_singleHH = function(y, pop_weights, threshold) {
                              mean(y[pop_weights == 1] < threshold)
                        }
    ), na.rm = TRUE, pop_weights = "eqsize"
)

[Package emdi version 2.2.2 Index]