stratification-package {stratification} | R Documentation |
Collection of Functions for Univariate Stratification of Survey Populations
Description
This package contains various functions for univariate stratification of survey populations. The well known cumulative root frequency rule by Dalenius and Hodges (1959) and the geometric rule by Gunning and Horgan (2004) are implemented. However, the main function implements a generalized Lavallee-Hidiroglou (1988) method of strata construction. It can be used with Sethi's (1963) or Kozak's (2004) algorithm. The generalized method takes into account a discrepancy between the stratification variable X
and the survey variable Y
. The method can consider a loglinear model with mortality between the variables (Baillargeon, Rivest and Ferland, 2007). When Kozak's algorithm is used, two additional models are available: a heteroscedastic linear model and a random replacement model as in Rivest (2002). The optimal boundaries determination also incorporates, if desired, an anticipated non-response, a take-all stratum for the large units and a take-none stratum for the small units. Moreover, units can be forced to be part of the sample by specifying a certainty stratum.
Details
Package: | stratification |
Type: | Package |
Version: | 2.2-7 |
Date: | 2022-04-06 |
License: | GPL-2 |
OVERWIEW OF THE FUNCTIONS
To determine the stratum sample sizes given a set of stratum boundaries: strata.bh
To determine the stratum boundaries first and, in a second step, the stratum sample sizes:
strata.cumrootf
: cumulative root frequency method by Dalenius and Hodges (1959)
strata.geo
: geometric method by Gunning and Horgan (2004)
To determine the optimal stratum boundaries and sample sizes in a single step:
strata.LH
: generalized Lavallee-Hidiroglou method with Sethi's (1963) or Kozak's (2004) algorithm
All these functions create an object of class "strata", which can be visualized with the S3 methods print.strata
and plot.strata
. One can also apply, with the function var.strata
, a stratified design to a survey variable Y
different from the one used for the construction of the stratified design.
INFORMATION RELATIVE TO MANY FUNCTIONS
The functions strata.bh
, strata.cumrootf
, strata.geo
and strata.LH
need to be given:
x
, the values of the stratification variable X
,
Ls
, the desired number of sampled strata,
alloc
, an allocation rule, and
a target sample size n
or a target level of precision CV
for the survey estimator.
However, for Sethi's (1963) algorithm, only a target CV
can be given. To reach a target n
using the generalized Lavallee-Hidiroglou method, Kozak's (2004) algorithm has to be used with the strata.LH
function.
TYPE OF STRATUM
In this package, four types of stratum exist: take-some, take-none, take-all and certainty. A take-some stratum is a stratum in which some units are sampled. A take-none stratum is a stratum for the smallest units in which no units are sampled. Its purpose is to ignore very small units. On the other hand, a take-all stratum is a stratum for the largest units in which every units are sampled. It allows to insure that the biggest units are in the sample. The following paragraph explains what the special stratum type called “certainty” is.
DEFINITION OF THE CERTAINTY STRATUM
It is possible to insure that some specific units are included in the sample with the argument certain
. This argument is a vector containing the positions in the vector x
of the units to be included with certainty in the sample. We say that these units form the certainty stratum. They are excluded from the population prior to the determination of the stratum boundaries, but they are accounted for in the calculation of the anticipated mean, the RRMSE, the total sample size and the optimization criteria. Essentially, these units form their own separate take-all stratum that is not subject to stratification. They do not have to be consecutive units according to the stratification variable, therefore their variance is meaningless. Non-response is not possible in the certainty stratum. The functions return a value named certain.info
containing the number of units in the certainty stratum and their anticipated mean.
NUMBER OF STRATA
The Ls
argument represents to the number of sampled strata. The term “sampled strata” refers to take-some and take-all strata only. Therefore, take-none and certain strata are not counted in Ls
. If the stratified design does not have a take-none stratum then Ls
=L
is the total number of strata, otherwise Ls
=L-1
. In the total number of strata L
, the certainty stratum, if any, is not counted since we do not need to find its boundaries.
STRATUM NUMBERING
Throughout the package, strata number 1 contains the smallest units and strata number L
the biggest ones. So every vector of boundaries contains numbers in ascending order. The function strata.bh
must be given boundaries bh
fulfilling this condition. This remark also applies to the argument initbh
of strata.LH
used to give initial boundaries for the optimization algorithm. If a take-none stratum is requested, it is always the first one. On the other hand, if a take-none stratum is requested, it is always the last one.
DEFINITION OF STRATUM BOUNDARIES
Let's note b_0, b_1,\ldots, b_L
the stratum boundaries. Stratum h
contains all the
units with an X
-value in the interval [b_{h-1},b_h)
for h=1,\ldots,L
such that b_0=min(X)
and b_L=max(X)+1
, where
min(X)
and max(X)
are respectively the minimum and the maximum values of the stratification variable. The argument bh
of strata.bh
, the argument initbh
of strata.LH
and the output value bh
of any function of the package stratification with the prefix "strata" are length L-1
vectors of the boundaries b_1, b_2,\ldots, b_{L-1}
.
DETAILS ABOUT THE TAKE-NONE STRATUM
A non empty take-none stratum induces a bias in the estimator of the mean of Y
, and the precision is measured by the relative root mean squared error (RRMSE), not by the coefficient of variation (CV). Regardless, in the functions the argument given to specify a target precision for the survey estimator is always named CV
. However, in the output, the anticipated level of precision is named RRMSE for the functions accepting a takenone
argument (strata.bh
and strata.LH
), and it is named CV for the other functions (strata.cumrootf
and strata.geo
).
When a takenone
stratum is requested, one can specify a bias.penalty
argument. We define the mean squared error for the estimator of the mean of Y
by MSE = (bias.penalty \times bias)^2 + variance
. It is sometimes possible to estimate the bias using the sum of the Y
values in the take-none stratum from administrative data. In this situation, it might be appropriate to set bias.penalty
to a value lower than 1. This will typically enlarge the take-none stratum. The value given to bias.penalty
depends on the confidence level we have in the bias estimate. By default, it is assumed that no bias estimate is available and the whole bias contributes to the MSE (bias.penalty
=1).
SPECIFICATION OF THE ALLOCATION RULE
The alloc
argument must be a list containing the numeric objects q1
, q2
and q3
which specify the allocation rule according to the general allocation scheme presented in Hidiroglou and Srinath (1993)
a_h = \frac{\gamma_h}{\sum_{\mbox{take-some}} \gamma_h} \qquad \mbox{where} \qquad
\gamma_h=N_h^{2q_1}\bar{Y}_h^{2q_2}S_{yh}^{2q_3}.
Stratum sample sizes are calculated as :
{n_h}_{\mbox{nonint}} = \left\{ \begin{array}{ll}
0 & \mbox{for take-none strata}\\
n \times a_h & \mbox{for take-some strata}\\
N_h & \mbox{for take-all strata}\end{array} \right.
A proportional allocation is obtained when q1
=0.5 and q2
=q3
=0,
a power allocation is obtained when q1
=q2
=p/2
and q3
=0, and
a Neyman allocation (the default) is obtained when q1
=q3
=0.5 and q2
=0.
ROUNDING of the stratum sample sizes
Applying the allocation rule above gives real (non-integer) values for the sample sizes. These are named nhnonint
in the package. The nhnonint
values have to be rounded to get the integer sample sizes, named nh
in the package. Here is how the rounding is done. If a target CV
is requested, the values are simply rounded to the largest integer. However, if a target n
is requested, the rounding is a little more complicated because the nh
should sum to the target n
and we do not want positive nh inferior to 1 to be rounded to zero. Therefore, we first round to 1 the positive nh inferior to 1. Then we calculate how many values (say nup
) must be rounded to the largest integer and how many must be rounded to the smallest integer in order to fulfill the condition sum(nh)=n
. We choose the nup
values with the largest decimal part for the ceiling rounding, the other nh
are rounded down.
ADJUSTMENT FOR A TAKE-ALL STRATUM
If, after applying the allocation rule, the stratified design contains at least one take-some stratum with {n_h}_{\mbox{nonint}} > N_h
, the allocation is done again setting the take-some stratum with the largest units as a take-all stratum. This is done until {n_h}_{\mbox{nonint}} \leq N_h
for all the take-some strata or until there is only one take-some stratum left. This adjustment is done automatically throughout the package because the target n or CV might not be reached if one omits to do it. Only the function strata.bh
allows not to do it (argument takeall.adjust
).
Note: In special circumstances, the algorithm might result in more than one take-all stratum. If the non-response rate does not vary among the take-all strata, we can see them as forming one big take-all stratum. Otherwise, their boundaries influence the value of the optimization criteria (n
or CV
). So in the case of a varying non-response rate among the take-all strata, we cannot see them as forming one big take-all stratum.
SPECIFICATION OF A MODEL BETWEEN Y
AND X
Every function can take into account a discrepancy between the stratification variable X
and the survey variable Y
. The functions strata.bh
, strata.cumrootf
and strata.geo
perform allocation on the basis of anticipated moments whereas the strata.LH
function goes further; it determines the optimal boundaries considering the anticipated moments. The following models for the relationship between Y
and X
can be specified through the model
and model.control
arguments:
- loglinear model with mortality (model="loglinear"
):
Y = \left\{ \begin{array}{ll}
\exp(\alpha + \mbox{\code{beta}} \ \log(X) + \mbox{\code{epsilon}}) & \mbox{with probability } p_h\\
0 & \mbox{with probability } 1-p_h \end{array} \right.
where \mbox{\code{epsilon}} \sim N(0,\mbox{\code{sig2}})
is independent of X
. The parameter p_h
is specified through ph
, ptakenone
and pcertain
(elements of model.control
). Note: The \alpha
parameter does not have to be specified because exp(\alpha)
is a multiplicative factor that has no impact on the outcome.
- heteroscedastic linear model (model="linear"
):
Y = \mbox{\code{beta}} X + \mbox{\code{epsilon}}
where \mbox{\code{epsilon}} \sim N(0,\mbox{\code{sig2}} \ X^{\mbox{\code{gamma}}})
.
- random replacement model (model="random"
):
Y = \left\{ \begin{array}{ll}
X & \mbox{with probability } 1-\mbox{\code{epsilon}} \\
Xnew & \mbox{with probability } \mbox{\code{epsilon}} \end{array} \right.
where Xnew
is a random variable independent of X
having the same distribution than X
.
The model.control
argument is a list that can supply any of the following model parameter:
beta
A numeric: the slope of the "loglinear" or "linear" model. The default is 1.
sig2
A numeric: the variance parameter of the "loglinear" or "linear" model. The default is 0.
ph
A vector giving the survival rate in each of the
Ls
sampled strata for the "loglinear" model. A single number can be given if the rate doesn't vary among strata. The default is 1 in each stratum.ptakenone
A numeric: the survival rate in the take-none stratum, if a take-none stratum is added to the stratified design. The default is 1.
pcertain
A numeric: the survival rate in the certainty stratum, if a certainty stratum is added to the stratified design. The default is 1.
gamma
A numeric: the exponent of
X
in the residual variance of the "linear" model. The default is 0.epsilon
A numeric: the probability that the
Y
-value for a unit is equal to theX
-value for a randomly selected unit in the population. It concerns the "random" model only. The default is 0.
Note: The default values of the parameters simplify any model to Y=X
. Therefore, the default is always to consider that there is no discrepancy between the stratification and the survey variables. The model
argument even has the default value "none"
, which also means Y=X
.
Author(s)
Sophie Baillargeon Sophie.Baillargeon@mat.ulaval.ca and
Louis-Paul Rivest Louis-Paul.Rivest@mat.ulaval.ca
References
Baillargeon, S., Rivest, L.-P., Ferland, M. (2007). Stratification en enquetes entreprises : Une revue et quelques avancees. Proceedings of the Survey Methods Section, 2007 SSC Annual Meeting.
Baillargeon, S. and Rivest, L.-P. (2009). A general algorithm for univariate stratification. International Stratification Review, 77(3), 331-344.
Baillargeon, S. and Rivest L.-P. (2011). The construction of stratified designs in R with the package stratification. Survey Methodology, 37(1), 53-65.
Dalenius, T. and Hodges, J.L., Jr. (1959). Minimum variance stratification. Journal of the American Statistical Association, 54, 88-101.
Gunning, P. and Horgan, J.M. (2004). A new algorithm for the construction of stratum boundaries in skewed populations. Survey Methodology, 30(2), 159-166.
Hidiroglou, M.A. and Srinath, K.P. (1993). Problems associated with designing subannual business surveys. Journal of Business & Economic Statistics, 11, 397-405.
Kozak, M. (2004). Optimal stratification using random search method in agricultural surveys. Statistics in Transition, 6(5), 797-806.
Lavallee, P. and Hidiroglou, M.A. (1988). On the stratification of skewed populations. Survey Methodology, 14, 33-43.
Rivest, L.-P. (2002). A generalization of the Lavallee and Hidiroglou algorithm for stratification in business surveys. Survey Methodology, 28(2), 191-198.
Sethi, V. K. (1963). A note on optimum stratification of populations for estimating the population means. The Australian Journal of Statistics, 5, 20-33.