fcompareCI {giniVarCI}R Documentation

Comparisons of variance estimates and confidence intervals for the Gini index in finite populations

Description

Compares variance estimates and confidence intervals for the Gini index in finite populations.

Usage

fcompareCI(
  y,
  w,
  Pi = NULL,
  Pij = NULL,
  PiU,
  alpha = 0.05,
  B = 1000L,
  digitsgini = 2L,
  digitsvar = 4L,
  na.rm = TRUE,
  plotCI = TRUE,
  line.types = c(1L, 2L, 4L),
  colors = c("red", "green", "blue"),
  shapes = c(8L, 4L, 3L),
  save.plot = FALSE,
  large.sample = FALSE)

Arguments

y

A vector with the non-negative real numbers to be used for estimating the Gini index.

w

A numeric vector with the survey weights to be used for estimating the Gini index, the variance estimation and the confidence interval. This argument can be missing if argument Pi is provided.

Pi

A numeric vector with the (sample) first inclusion probabilites to be used for estimating the Gini index, the variance estimation and the confidence interval. This argument can be NULL if argument w is provided. The default value is Pi = NULL.

Pij

A numeric square matrix with the (sample) second (joint) inclusion probabilites to be used for the variance estimation and the confidence interval. The Hajek approximation is used when Pij = NULL. This argument is used by the intervals "zjackknife", "zalinearization" and "zblinearization". The default value is Pij = NULL.

PiU

A numeric vector with the (population) first inclusion probabilites. The Hartley-Rao (HR) expression for the variance estimation is also computed if this argument is provided.

alpha

A single numeric value between 0 and 1 specifying the confidence level 1-alpha to be used for computing the confidence interval for the Gini index. Some authors call alpha the significance level. The default value is alpha = 0.05.

B

A single integer specifying the number of bootstrap replicates. The default value is B = 1000L.

digitsgini

A single integer specifying the number of decimals used in the estimation of the Gini index and confidence intervals. The default value is digitsgini = 2L.

digitsvar

A single integer specifying the number of decimals used in the variance estimation of the Gini index. The default value is digitsvar = 4L.

na.rm

A 'TRUE/FALSE' logical value indicating whether NA values should be removed before the computation proceeds. The default value is na.rm = TRUE.

plotCI

A 'TRUE/FALSE' logical value indicating whether confidence intervals are compared using a plot. The default value is plotCI = TRUE.

line.types

A numeric vector of length 3 specifying the line types. See the function plot for the different line types. The default value is line.types = c(1L, 2L, 4L).

colors

A vector of length 3 specifying the colors for lines of the plot. The default value is colors = c("red", "green", "blue").

shapes

A numeric vector specifying the point shapes for the limits of intervals. If PiU is missing, the function uses the two first components of shapes, i.e., it must have at least length 2. If PiU is provided, shapes must have at least length 3. See the function plot for the different point shapes. The default value is shapes = c(8L, 4L, 3L).

save.plot

A 'TRUE/FALSE' logical value indicating whether the ggplot object of the plot comparing the confidence intervals should be saved in the output. The default value is save.plot = FALSE.

large.sample

A 'TRUE/FALSE' logical value indicating whether the sample is large to apply a faster algorithm to sort the sample values in the computation of the Gini index. The default value is large.sample = FALSE.

Details

For a sample S, with size n and inclusion probabilities \pi_i=P(i\in S) (argument Pi), derived from a finite population U, with size N, different formulations of the Gini index have been proposed in the literature. This function estimates the Gini index, variances and confidence intervals using various formulations. The different methods for estimating the Gini index are (see also Muñoz et al., 2023):

\ Gini Index formulae.

Method 1 (Langel and Tillé, 2013)

\widehat{G}_{w1}= \displaystyle \frac{1}{2\widehat{N}^{2}\overline{y}_{w}}\sum_{i \in S}\sum_{j \in S}w_{i}w_{j}|y_{i}-y_{j}|,

where \widehat{N}=\sum_{i \in S}w_i, \overline{y}_{w}=\widehat{N}^{-1}\sum_{i \in S}w_{i}y_{i}, and w_i are the survey weights. For example, the survey weights can be w_i=\pi_{i}^{-1}. w or Pi must be provided, but not both. It is required that w_i = \pi_i^{-1}, for i \in S, when both w and Pi are provided.

Method 2 (Alfons and Templ, 2012; Langel and Tillé, 2013)

\widehat{G}_{w2} =\displaystyle \frac{2\sum_{i \in S}w_{(i)}^{+}\widehat{N}_{(i)}y_{(i)} - \sum_{i \in S}w_{i}^{2}y_{i} }{\widehat{N}^{2}\overline{y}_{w}}-1,

where y_{(i)} are the values y_i sorted in increasing order, w_{(i)}^{+} are the values w_i sorted according to the increasing order of the values y_i, and \widehat{N}_{(i)}=\sum_{j=1}^{i}w_{(j)}^{+}. Langel and Tillé (2013) show that \widehat{G}_{w1} = \widehat{G}_{w2}, so the computation of \widehat{G}_{w1} is ommited in results.

Method 3 (Berger, 2008)

\widehat{G}_{w3} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}^{\ast}(y_{i})-1,

where

\widehat{F}_{w}^{\ast}(t) = \displaystyle \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}[\delta(y_i < t) + 0.5\delta(y_i = t)]

is the smooth (mid-point) distribution function, and \delta(\cdot) is the indicator variable that takes the value 1 when its argument is true, and 0 otherwise. It can be seen that \widehat{G}_{w2} = \widehat{G}_{w3}, so the computation of \widehat{G}_{w3} is ommited in results.

Method 4 (Berger and Gedik-Balay, 2020)

\widehat{G}_{w4} = 1 - \displaystyle \frac{\overline{v}_{w}}{\overline{y}_{w}},

where \overline{v}_{w}=\widehat{N}^{-1}\sum_{i \in S}w_{i}v_{i} and

v_{i} = \displaystyle \frac{1}{\widehat{N} - w_{i}}\sum_{ \substack{j \in S\\ j\neq i}}\min(y_{i},y_{j}).

Method 5 (Lerman and Yitzhaki, 1989)

\widehat{G}_{w5} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}} \sum_{i \in S} w_{(i)}^{+}[y_{(i)} - \overline{y}_{w}]\left[ \widehat{F}_{w}^{LY}(y_{(i)}) - \overline{F}_{w}^{LY} \right],

where

\widehat{F}_{w}^{LY}(y_{(i)}) = \displaystyle \frac{1}{\widehat{N}}\left(\widehat{N}_{(i-1)} + \frac{w_{(i)}^{+}}{2} \right)

and \overline{F}_{w}^{LY}=\widehat{N}^{-1}\sum_{i \in S}w_{(i)}^{+}\widehat{F}_{w}^{LY}(y_{(i)}).

\ Variances and confidence intervals.

For a given estimator \widehat{G}_{w} and variable z, the Horvitz-Thompson type variance estimator (Hortvitz and Thompson, 1952) is given by

\widehat{V}_{HT}(\widehat{G}_{w}) = \displaystyle \sum_{i\in S}\sum_{j\in S}\breve{\Delta}_{ij}w_{i}w_{j}z_{i}z_{j},

where

\breve{\Delta}_{ij}=\displaystyle \frac{\pi_{ij}-\pi_{i}\pi_{j}}{\pi_{ij}}

and \pi_{ij} is the second (joint) inclusion probability of the individuals i and j, i.e., \pi_{ij}=P\{(i,j)\in S)\} (argument Pij).

The Sen-Yates-Grundy type variance estimator (Sen, 1953; Yates and Grundy, 1953) is defined as

\widehat{V}_{SYG}(\widehat{G}_{w}) = - \displaystyle \frac{1}{2}\sum_{i\in S}\sum_{j\in S}\breve{\Delta}_{ij}(w_{i}z_i-w_{j}z_{j})^{2}

.

The Hartley-Rao type variance estimator (Hartley and Rao, 1962) is given by

\widehat{V}_{HR}(\widehat{G}_{w}) = \displaystyle \frac{1}{n-1}\sum_{i\in S}\sum_{\substack{j \in S\\ j < i}}\left(1-\pi_i-\pi_j + \frac{1}{n}\sum_{k\in U}\pi_{k}^{2} \right)(w_{i}z_i-w_{j}z_{j})^{2}.

Note that the The Horvitz-Thompson variance estimator can give negative values. We observe that both Horvitz-Thompson and Sen-Yates-Grundy variance estimators depend on second (joint) inclusion probabilities (argument Pij). The Hajek (1964) approximation

\pi_{ij}\cong \pi_{i}\pi_{j}\left[1- \displaystyle \frac{(1-\pi_{i})(1-\pi_{j})}{\sum_{i \in S}(1-\pi_{i})} \right]

is used when the second (joint) inclusion probabilities are not available (Pij = NULL). Note that the Hajek approximation is suggested for large-entropy sampling designs, large samples, and large populations (see Tille 2006; Berger and Tillé, 2009; Haziza et al., 2008; Berger, 2011). For instance, this approximation is not recomended for highly-stratified samples (Berger, 2005). The Hartley-Rao variance estimator requires the first inclusion probabilities at the population level (argument PiU). zjackknife computes the confidence interval based on the jackknife technique with critical values based on the Normal approximation. zalinearization and zblinearization compute the confidence intervals based on the linearization technique applied to the estimators

\widehat{G}_{w}^{a} = \widehat{G}_{w1}

and

\widehat{G}_{w}^{b} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}(y_{i})-1,

respectively, where

\widehat{F}_{w}(t)=\frac{1}{\widehat{N}}\sum_{i \in S}w_i\delta(y_i \leq t).

Critical values are also based on the Normal approximation. pbootstrap computes the variance using the rescaled bootstrap, and the confidence interval is constructed using the percentile method. The vignette vignette("GiniVarInterval") contains a detailed description of the various methods for variance estimation and confidence intervals for the Gini index.

The following table summarises the various types of variances and confidence intervals that the function fcompareCI computes.

Interval Variance Critical values References
_______________ ______________ _________________ _________________________
zjackknife Jackknife Normal Berger (2008)
zalinearization Linearization Normal Langel and Tille (2013)
zblinearization Linearization Normal Berger (2008)
pBootstrap Rescaled bootstrap Percentile bootstrap Berger and Gedik-Balay (2020)

Value

If save.plot = FALSE, a data frame with columns:

  1. interval. The method used to construct the confidence interval.

  2. method. The method used to estimate the Gini index.

  3. varformula. The type of formula for the variance estimator. Posible values are HT and SYG if argument PiU is missing, and HT, SYG amd HR if argument PiU is provided.

  4. gini. The estimation of the Gini index.

  5. lowerlimit. The lower limit of the confidence interval.

  6. upperlimit. The upper limit of the confidence interval.

  7. var.gini. The variance estimation for the estimator of the Gini index.

If save.plot = TRUE, a list with two components: (i) 'base.CI' a data frame of seven columns as just described and (ii) 'plot' a (ggplot) description of the plot, which is a list with components that contain the plot itself, the data, information about the scales, panels, etc. As a side-effect, a plot that compares the various methods for constructing confidence intervals for the Gini index is displayed. **ggplot2** is needed to be installed for this option to work.

If plotCI = TRUE, as a side-effect, a plot that compares the various methods for constructing confidence intervals for the Gini index is displayed. **ggplot2** is needed to be installed for this option to work.

Author(s)

Juan F Munoz jfmunoz@ugr.es

Jose M Pavia pavia@uv.es

Encarnacion Alvarez encarniav@ugr.es

References

Alfons, A., and Templ, M. (2012). Estimation of social exclusion indicators from complex surveys: The R package laeken. KU Leuven, Faculty of Business and Economics Working Paper.

Berger, Y. G. (2005). Variance estimation with highly stratified sampling designs with unequal probabilities. Australian & New Zealand Journal of Statistics, 47, 365–373.

Berger, Y. G. (2008). A note on the asymptotic equivalence of jackknife and linearization variance estimation for the Gini Coefficient. Journal of Official Statistics, 24(4), 541-555.

Berger, Y. G. (2011). Asymptotic consistency under large entropy sampling designs with unequal probabilities. Pakistan Journal of Statistics, 27, 407–426.

Berger, Y., and Gedik-Balay, İ. (2020). Confidence intervals of Gini coefficient under unequal probability sampling. Journal of Official Statistics, 36(2), 237-249.

Berger, Y. G. and Tillé, Y. (2009). Sampling with unequal probabilities. In Sample Surveys: Design, Methods and Applications (eds. D. Pfeffermann and C. R. Rao), 39–54. Elsevier, Amsterdam.

Hajek, J. (1964). Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35, 4, 1491–1523.

Hartley, H. O., and Rao, J. N. K. (1962). Sampling with unequal probabilities and without replacement. The Annals of Mathematical Statistics, 350-374.

Haziza, D., Mecatti, F. and Rao, J. N. K. (2008). Evaluation of some approximate variance estimators under the Rao-Sampford unequal probability sampling design. Metron, LXVI, 91–108.

Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.

Langel, M., and Tillé, Y. (2013). Variance estimation of the Gini index: revisiting a result several times published. Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(2), 521-540.

Lerman, R. I., and Yitzhaki, S. (1989). Improving the accuracy of estimates of Gini coefficients. Journal of econometrics, 42(1), 43-47.

Muñoz, J. F., Moya-Fernández, P. J., and Álvarez-Verdejo, E. (2023). Exploring and Correcting the Bias in the Estimation of the Gini Measure of Inequality. Sociological Methods & Research. https://doi.org/10.1177/00491241231176847

Sen, A. R. (1953). On the estimate of the variance in sampling with varying probabilities. Journal of the Indian Society of Agricultural Statistics, 5, 119–127.

Tillé, Y. (2006). Sampling Algorithms. Springer, New York.

Yates, F., and Grundy, P. M. (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society B, 15, 253–261.

See Also

fgini, fginindex

Examples

# Income and weights (region 'Burgenland') from the 2006 Austrian EU-SILC (Package 'laeken').
data(eusilc, package="laeken")
y <- eusilc$eqIncome[eusilc$db040 == "Burgenland"]
w <- eusilc$rb050[eusilc$db040 == "Burgenland"]

# Estimation of the Gini index and confidence intervals using different methods.
fcompareCI(y, w)

y <- c(30428.83, 14976.54, 18094.09, 29476.79, 20381.93, 6876.17,
       10360.96, 8239.82, 29476.79, 32230.71)
w <- c(357.86, 480.99, 480.99, 476.01, 498.58, 498.58, 476, 498.58, 476.01, 476.01)
fcompareCI(y, w, plotCI = FALSE)

[Package giniVarCI version 0.0.1-3 Index]