R: Nested error regression Model under transformations

NER_Trafo {saeTrafo}

R Documentation

Nested error regression Model under transformations

Description

Function NER_Trafo estimates small area means based on the (transformed) nested error regression (NER) model (Battese et al., 1988). In contrast to the empirical best predictor of Molina and Rao (2010), which is implemented in the package emdi (ebp), no unit-level population data are required.

NER_Trafo supports the log as well as the data-driven log-shift transformation. Especially for skewed variables, (data-driven) transformations are useful to meet the model assumptions for the error terms. If a transformation is chosen and aggregates (means and covariance) are simultaneously provided for the population, point estimates are produced by the method of Wuerz et al. (2022), which uses kernel density estimation to resolve the issue of not having access to population micro-data. In the case that population data are available at unit-level and the log or log-shift transformation is selected, the bias-correction of Berg and Chandra (2014) and Molina and Martín (2018) is applied. For this data situation, more methods and options are provided in the package emdi. If only population means are available and the log or log-shift transformation is selected, a bias-correction due to the transformation is added but for the lack of access to population data no correction is available. Therefore, a part of the bias is disregarded.

Additionally, analytically mean squared errors (MSE) are calculated in the case of no transformation following Prasad and Rao (1990). For the log and log-shift transformation, a parametric bootstrap procedure proposed by Wuerz et al. (2022) following Gonzalez-Manteiga et al. (2008) is applied. Please note that this can only be determined if covariance data are also provided. If population data is available on unit-level a bootstrap procedure as described in Molina and Martín (2018) is applied.

Usage

NER_Trafo(
  fixed,
  pop_area_size = NULL,
  pop_mean = NULL,
  pop_cov = NULL,
  pop_data = NULL,
  pop_domains = NULL,
  smp_data,
  smp_domains,
  threshold = 30,
  B = 50,
  transformation = "log.shift",
  interval = "default",
  MSE = FALSE,
  parallel_mode = ifelse(grepl("windows", .Platform$OS.type), "socket", "multicore"),
  cpus = 1,
  seed = 123
)

Arguments

`fixed`	a two-sided linear formula object describing the fixed-effects part of the nested error linear regression model with the dependent variable on the left of a ~ operator and the explanatory variables on the right, separated by + operators. The argument corresponds to the argument `fixed` in function `lme`.
`pop_area_size`	a named numeric vector containing the number of individuals within each domain. This numeric vector is named with the domain names.
`pop_mean`	a named list. Each element of the list contains the population means for the p covariates for a specicfic domain. The list is named with the respective domain name. The numeric vector within the list is named with the covariate names. The covariates right of the ~ operator in `fixed` need to comprise.
`pop_cov`	a named list. Each element of the list contains the domain-specific covariance matrice for p covariates for a specicfic domain. The list is named with the respective domain name. The matrix within the list has row and column names with the respective covariate names. The covariates right of the ~ operator in `fixed` need to comprise. If `pop_cov` is not available only a bias-correction due to the transformation is added but for the lack of access to population data a correction is not possible. Additionally, no MSE could be determined.
`pop_data`	a data frame that needs to comprise the variables named on the right of the ~ operator in `fixed`, i.e. the explanatory variables, and `pop_domains`. Please note, if population data is available other methods using unit-level population data, like `ebp`, could be applied.
`pop_domains`	a character string containing the name of a variable that indicates domains in the population data. The variable can be numeric or a factor but needs to be of the same class as the variable named in `smp_domains`. Only needed if population data are given.
`smp_data`	a data frame that needs to comprise all variables named in `fixed` and `smp_domains`.
`smp_domains`	a character string containing the name of a variable that indicates domains in the sample data. The variable can be numeric or a factor but needs to be of the same class as the variable named in `pop_domains`.
`threshold`	a numeric value indicating the threshold for using pooled domain data (for domains with sample sizes below the threshold) or non pooled domain data (for domains with sample sizes above the threshold) for the density estimation within the approach of Wuerz et al. (2022). Defaults to 30.
`B`	a number determining the number of bootstrap replications in the parametric bootstrap approach. The number must be greater than 1. Defaults to 50. For practical applications, values larger than 200 are recommended.
`transformation`	a character string. Three different transformation types for the dependent variable can be chosen (i) no transformation ("no"); (ii) log transformation ("log"); (iii) Log-Shift transformation ("log.shift"). Defaults to `"log.shift"`.
`interval`	a string equal to 'default' or a numeric vector containing a lower and upper limit determining an interval for the estimation of the optimal parameter for the log-shift transformation. The interval is passed to function `optimize` for the optimization. Defaults to an interval based on the range of y. If the convergence fails, it is often advisable to choose a smaller more suitable interval. For right skewed distributions, the negative values may be excluded.
`MSE`	optional logical. If `TRUE`, MSE estimates are provided. Defaults to `FALSE`.
`parallel_mode`	modus of parallelization, defaults to an automatic selection of a suitable mode, depending on the operating system, if the number of `cpus` is chosen higher than 1. For details, see `parallelStart`.
`cpus`	number determining the kernels that are used for the parallelization. Defaults to 1. For details, see `parallelStart`.
`seed`	an integer to set the seed for the random number generator. For the usage of random number generation, see Details. If seed is set to `NULL`, seed is chosen randomly. Defaults to `123`.

Details

For the parametric bootstrap and the density estimation approach random number generation is used. Thus, a seed is set by the argument seed.

Value

An object of class "NER", "saeTrafo" that provides estimators for regional means optionally corresponding MSE estimates. Several generic functions have methods for the returned object. For a full list and descriptions of the components of objects of class "saeTrafo", see saeTrafoObject.

References

Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An Error-Components Model for Predictions of County Crop Areas Using Survey and Satellite Data. Journal of the American Statistical Association, Vol.83, No. 401, 28-36.

Berg, E. and Chandra, H. (2014). Small area prediction for a unit-level lognormal model. Computational Statistics & Data Analysis, Vol.78, 159–175.

González-Manteiga, W., Lombardía, M. J., Molina, I., Morales, D. and Santamaría, L. (2008). Analytic and bootstrap approximations of prediction errors under a multivariate Fay–Herriot model. Computational Statistics & Data Analysis, Vol. 52, No. 12, 5242-5252.

Molina, I. and Martín, N. (2018). Empirical best prediction under a nested error model with log transformation. The Annals of Statistics, Vol.46, No. 5, 1961–1993.

Molina, I. and Rao, J.N.K. (2010). Small area estimation of poverty indicators. The Canadian Journal of Statistics, Vol. 38, No.3, 369-385.

Prasad, N.N., Rao, J.N.K. (1990). The estimation of the mean squared error of small-area estimators. Journal of the American statistical association, Vol. 85, No. 409, 163-171.

Wuerz, N., Schmid, T., and Tzavidis, N. (2022) Estimating regional income indicators under transformations and access to limited population auxiliary information. Journal of the Royal Statistical Society: Series A (Statistics in Society), Vol. 185, No. 4, 1679-1706.

Examples


# Examples for (transformed) nested error regression model

# Load Data
data("eusilcA_pop")
data("eusilcA_smp")
data("pop_area_size")
data("pop_mean")
data("pop_cov")

# formula object for all examples
formula <- eqIncome ~ gender + eqsize + cash + self_empl + unempl_ben +
                      age_ben + surv_ben + sick_ben + dis_ben + rent +
                      fam_allow + house_allow + cap_inv + tax_adj

# For all four examples, no MSEs/variances are determined in order to avoid
# long run times. These can be obtained with MSE = TRUE.

# Example 1: No transformation - classical NER
NER_model_1 <- NER_Trafo(fixed = formula, transformation = "no",
                         smp_domains = "district", smp_data = eusilcA_smp,
                         pop_area_size = pop_area_size, pop_mean = pop_mean)

# Example 2: Log-shift transformation and population aggregates
# (means and covariances) with changed threshold
NER_model_2 <- NER_Trafo(fixed = formula,
                         smp_domains = "district", smp_data = eusilcA_smp,
                         pop_area_size = pop_area_size, pop_mean = pop_mean,
                         pop_cov = pop_cov, threshold = 50)

# Example 3: Log-shift transformation and population data
# A bias-corrections which need unit-level population data are applied
NER_model_3 <- NER_Trafo(fixed = formula,
                         smp_domains = "district", smp_data = eusilcA_smp,
                         pop_data = eusilcA_pop, pop_domains = "district")


# Example 4: Log-shift transformation and population aggregates
# (only means (!) - Therefore, no MSE estimation is available, bias is
# disregarded)
NER_model_4 <- NER_Trafo(fixed = formula,
                         smp_domains = "district", smp_data = eusilcA_smp,
                         pop_area_size = pop_area_size, pop_mean = pop_mean)

[Package saeTrafo version 1.0.4 Index]