strata.LH {stratification} | R Documentation |
Generalized Lavallee-Hidiroglou Method of Strata Construction
Description
This function aims at constructing optimal strata with a generalized Lavallee-Hidiroglou (1998) method.
The function uses Kozak's (2004) algorithm by default, but it can also apply Sethi's (1963) algorithm (argument algo="Sethi"
). The function can take into account a discrepancy between the stratification variable X
and the survey variable Y
. It can consider a loglinear model with mortality between the variables (Baillargeon and Rivest, 2009). With Kozak's algorithm, two additional models are implemented: an heteroscedastic linear model and a random replacement model as in Rivest (2002). The determination of the optimal boundaries also incorporates, if desired, an anticipated non-response, a take-all stratum for the large units, a take-none stratum for the small units, and a certainty stratum to ensure that some specific units are in the sample.
Sethi's algorithm is not used by default because it can be numerically unstable, especially with a take-none stratum. Better results were obtained with Kozak's algorithm in our numerical experiments.
Usage
strata.LH(x, n = NULL, CV = NULL, Ls = 3, certain = NULL,
alloc = list(q1 = 0.5, q2 = 0, q3 = 0.5), takenone = 0,
bias.penalty = 1, takeall = 0, rh = rep(1, Ls),
model = c("none", "loglinear", "linear", "random"),
model.control = list(), initbh = NULL,
algo = c("Kozak", "Sethi"), algo.control = list())
Arguments
x |
A vector containing the values of the stratification variable |
n |
A numeric: the target sample size. It has no default value. The argument |
CV |
A numeric: the target coefficient of variation or relative root mean squared error if |
Ls |
A numeric: the number of sampled strata (take-none and certain strata are not counted in |
certain |
A vector giving the position, in the vector |
alloc |
A list specifying the allocation scheme. The list must contain 3 numerics for the 3 exponents |
takenone |
A numeric: the number of take-none strata (0 or 1). The default is 0, i.e. no take-none stratum is included. |
bias.penalty |
A numeric between 0 and 1 giving the penalty for the bias in the anticipated mean squared error (MSE) of the survey estimator (see |
takeall |
A numeric: the number of take-all strata (one of {0, 1, ..., |
rh |
A vector giving the anticipated response rates in each of the |
model |
A character string identifying the model used to describe the discrepancy between the stratification variable |
model.control |
A list of model parameters (see |
initbh |
A vector of initial stratum boundaries (see Details). |
algo |
A character string identifying which optimization algorithm is to be used. It can be "Kozak" for Kozak's (2004) algorithm or "Sethi" for Sethi's (1963) algorithm (see Details). The default is "Kozak" because it performs better and offers more options than Sethi's algorithm. |
algo.control |
A list of parameters to control the optimization algorithm (see Details). |
Details
CALCULATION OF THE OPTIMIZATION CRITERIA
Kozak's and Sethi's algorithms aim at finding optimal stratum boundaries. This optimality refers to a criteria :
the total sample size n
if a target CV
was given
the RRMSE if a target n
was given (not available for Sethi's algorithm).
This criteria, no matter if it is n
or the RRMSE, is a function of the stratum sample sizes, among others. But these stratum sample sizes can be non-integer (nhnonint
obtained directly from applying the allocation rule) or integer (nh
obtained from rounding the nhnonint
), as mentioned in stratification-package
. The optimization criteria does not vary much depending on the sample size used, but it is not exactly the same. Let's call opti.nh
the criteria calculated with nh
and opti.nhnonint
the criteria calculated with nhnonint
. Optimizing based on opti.nh
vs opti.nhnonint
might not give the same result. In fact, according to our numerical tests, it is a little better to optimize based on opti.nh
. This is logical since the sample sizes used in practice, to apply a stratified design, are the integer one. Therefore, as of version 2.1-0 of the package, Kozak's algorithm uses by default the integer sample sizes nh
to calculate the optimization criteria (but it can also use opti.nhnonint
by setting idopti="nhnonint"
in algo.control
). However, Sethi's algorithm works with derivatives to perform optimization. Therefore, it does not treat the optimization as a discrete problem as Kozak's algorithm does. It can only work with a real optimization criteria, i.e. the criteria opti.nhnonint
calculated with non-integer sample size. This is one more reason to favour Kozak's algorithm over Sethi's one.
INITIAL STRATUM BOUNDARIES
Both Kozak's and Sethi's algorithm are iterative and they need starting values. It can be given by the user with the initbh
argument. If not given, the default initial boundaries are obtained with the function strata.cumrootf
(cumulative root frequency method) for Kozak's algorithm. For Sethi's algorithm, the default value of initbh
is the set of Ls
-1 equidistant points along the range of the X
-values (arithmetic starting points of Gunning and Horgan (2007)). Let's define d=(max(X)-min(X))/Ls
, the arithmetic boundaries are b_h = min(X) + h \times d
for h=1,\ldots,Ls-1
.
The default initial boundaries are not the same for the two algorithms because Kozak's algorithm uses by default the criteria opti.nh
whereas Sethi's algorithm can only use opti.nhnonint
. In our numerical experiments with the criteria opti.nh
, the cumulative root frequency boundaries performed a little better than geometric, arithmetic or robust boundaries (for a definition of theses boundaries, see details about the trymany
algorithm parameter). On the other hand, the arithmetic boundaries had good performances (Baillargeon and Rivest 2009) with the criteria opti.nhnonint
. Let's note however that we never found, in any of our numerical experiments, initial boundaries clearly better than others.
The length of the vector initbh
can be Ls
-1 or L
-1, where L
is the total number of strata. When takenone=0
, L
=Ls
and it does not make any difference. But for a stratified design with a take-none stratum, it means that one can give as initbh
argument the vector of initial boundaries (b_1, b_2, \ldots, b_{L-1})
or (b_2, \ldots, b_{L-1})
. In the second option, the upper boundary of the take-none stratum is not given. In that case, it is set by default to
the first percentile of x
. If this first percentile is equal to the minimum value of x
, this initial
boundary would lead to an empty take-none stratum. In that case, the initial boundary of the take-none stratum is
rather set to the second smallest value of x
.
Sometimes, the specified initial boundaries are not suitable for Kozak's algorithm. As written below, this algorithm
verifies at each iteration that the sampled strata contains at least minNh
units and that they have positive
sample sizes nh
. If the initial boundaries do not meet these conditions, the initial optimization criteria is not comparable to the criteria for boundaries respecting the conditions. Therefore, in such a situation, a warning is produced and the algorithm is not run.
ALGORITHMS
Sethi: The formulas implemented for Sethi's algorithm are presented in Baillargeon, Rivest and Ferland (2007). As mentioned previously, this algorithm is available for a target CV only. Moreover, it works with derivatives to perform optimization. Therefore, it can only work with a real optimization criteria (opti.nhnonint
).
Kozak: Kozak's algorithm is described in Kozak (2004). It starts at the initial boundaries initbh
and at each iteration, it chooses a stratum boundary at random and a random modification for this boundary among the 2*maxstep
possible alternatives. If this modification reduces the optimization criteria, creates sampled strata containing at least minNh
units and leads to positive nh
, it is accepted. Otherwise, the boundaries are not changed. If the boundaries remain unchanged for maxstill
consecutive iterations, the algorithm has reached convergence and it stops.
Kozak's (2004) used the condition nh
>= 2 instead of nh
> 0. We chose to modify this requirement because we noticed that more severe conditions sometimes prevented the algorithm from selecting a path leading to the optimal solution. According to our numerical experiments, if one sets a posteriori to 2 the nh
's that are equal to 1 at the end of the algorithm, the new sample size is smaller than or equal to the one obtained with the condition nh
>= 2.
If no initbh
argument is given, the algorithm is run with three different sets of initial boundaries (see details of the algorithm parameter trymany
). If initbh
is given, it uses these initial boundaries. Bot no matter how many initial boundaries are tried, the algorithm is run rep
times with each. Running the algorithm many times helps having more stable results. Since the algorithm is random, two runs of the algorithm, event from the same starting point, can lead to different results. Setting rep
to a large value and trying more than one set of initial boundaries increases the calculation time, but also increases the chance of finding the global minimum. also helps in finding the global minimum.
The optimization of stratum boundaries given a target CV or a target n
is a discrete problem. The number of possible sets of boundaries is choose(Nu-1,L-1)
when takenone=0
, and choose(Nu,L-1)
when takenone=1
, where Nu
stands for the number of unique values in the x
-vector from which units in the certainty stratum, if any, heve been removed. When this number is not too large, trying them all is numerically possible. That's what the function strata.LH
do, for Kozak's algorithm, when the number of possible solutions is lower than the minsol
parameter in algo.control
. This complete enumeration of all possible cases ensure that the global minimum is reached.
Note: Since version 2.2-0 of the package stratification, the modified Kozak's algorithm, which is a non-random version of Kozak's algorithm, is no more available because it never performed as well as the original Kozak's algorithm in our numerical experiments.
ALGORITHM PARAMETERS
The algo.control
argument is a list to supply any of the following parameters which control the algorithm. Sethi's algorithm only uses the first argument maxiter
.
maxiter
A numeric: the maximal number of iterations. The default is 500 for Sethi's algorithm and 10 000 for Kozak's. It is only used to prevent from infinite loops in case of non-convergence.
minsol
A numeric for Kozak's algorithm only: if the number of possible sets of boundaries is lower than
minsol
, a complete enumeration of the solutions is performed instead of running the algorithm. Every set of boundaries is tried, which ensures that the global minimum is reached. The default value ofminsol
is 10 000. This parameter has to take a value between 2 and 2 000 000.idopti
A character string for Kozak's algorithm only: this argument determines which stratum sample sizes are going to be used to calculate the optimization criteria. It can take the value
"nh"
(criteria calculated with the integer sample sizesnh
) or"nhnonint"
(criteria calculated with the non-integer sample sizesnhnonint
). The default value is"nh"
since it gives slightly better results thanidopti="nhnonint"
and also because the integer sample sizes are the ones used in practice. When a complete enumeration is performed, the criteria is automatically calculated with integer stratum sample sizes. Note: Prior to version 2.1-0 of the package stratification, only the optionidopti="nhnonint"
was available (in fact the algorithm parameteridopti
did not exist).minNh
A numeric for Kozak's algorithm only: the minimum number of units required in each sampled stratum (no restriction is put on the take-none stratum, if included).
minNh
must be greater or equal to 2, which is the default.maxstep
A numeric for Kozak's algorithm only: the maximal step for boundary modification (see the algorithm description above). The default is
Nu/10
, rounded up and truncated to 100 (Nu
is the number of unique values in thex
-vector from which units in the certainty stratum, if any, heve been removed). This default value is, for most populations, much larger than the initial suggestion by Kozak (2004) of a integer no bigger than 5. Our numerical experiments showed us that amaxstep
value of about 3 (package stratification initial default value) is adequate when the optimization criteria is calculated with non-integer stratum sample sizes (opti.nhnonint
). However, when this criteria is calculated with integer stratum sample sizes (opti.nh
), the algorithm never do worst and sometimes reaches an even lower optimization criteria with a largermaxstep
. The parametermaxstill
must be increased consequently (see below). The downside of largemaxstep
andmaxstill
values is that many iterations are needed for the algorithm to converge, making it longer to run. We implemented a solution to this problem. This solution is based on the fact that when the algorithm is close to the solution, it can only accept boundaries modifications with small steps. The large steps are useful during the first iterations of the algorithm to let the boundaries move freely, but it becomes useless when convergence is close. Therefore, the values ofmaxstep
andmaxstill
are brought down to 3 and 50, respectively, when new boundaries are accepted under these circumstances : the step in the boundary modification is lower than 3 in absolute value, more than 50 iterations without change happened before this change, the relative change in the optimization criteria is small. This rule is arbitrary, but it seems to work well so far. It gives results as good as without changing the values ofmaxstep
andmaxstill
, but it runs much faster because it needs less iterations to converge.maxstill
A numeric for Kozak's algorithm only: the maximal number of iterations without a change in the boundaries (see the algorithm description above). The default is
maxstep
*10, bounded between 50 and 500. It depends on the value ofmaxstep
because we increased the default value of this parameter by making it a function of the number of unique values in thex
-vector (see above). The upper bound of 500 ensures that the algorithm does not have to make too much iterations (which take time) to converge.rep
A numeric for Kozak's algorithm only: the number of repetitions of the algorithm (see the algorithm description above). The default is 5. Since version 2.2-0 of the package stratification, the option
rep="change"
has been replaced by the optiontrymany=TRUE
.trymany
A logical for Kozak's algorithm only: if
trymany=TRUE
, three sets of initial boundaries are tried instead of one. These boundaries are cumulative root frequency (Dalenius and Hodges, 1959), geometric (Gunning and Horgan, 2004) and robust ones. Robust boundaries were created in order to have boundaries respecting as often as possible the conditions imposed on boundaries in Kozak's algorithm: strata containing at leastminNh
units and positivenh
. Robust boundaries give an empty take-none stratum if such a stratum is requested, take-all strata as small as possible, and take-some strata with approximately the same number of uniqueX
-values. Whentrymany=TRUE
, the same values for the other algorithm parameters (the one given by the user or otherwise the default) are used for every set of initial boundaries. Iftrymany=FALSE
, only the given or default (see above) initial boundaries are used. If a user give aninitbh
argument, but also selecttrymany=TRUE
, the later is ignored and only theinitbh
initial boundaries are used.
CONVERGENCE
It is possible for the algorithm not to converge. In this case, a warning is printed. The only possible cause of non-convergence for Kozak's algorithm is to reach the maximum number of iterations before the stopping rule has been met. For the algorithm to converge, the argument maxiter
has to be increased. On the other hand, non-convergence of Sethi's algorithm has several possible causes:
a division by zero caused by an empty stratum can occur ;
a division by zero caused by a 0 stratum variance can occur ;
the square root of a negative number (negative discriminant) can occur ;
the maximum number of iterations can be reached (often because the algorithm is caught in a loop of non-optimal sets of boundaries).
If a non-convergence happens, the user can try to change the initial boundaries or the model parameters. The user can also choose to work with Kozak's algorithm which should converge given an appropriate maximum number of iterations.
WARNING ABOUT LOCAL MINIMA
Let's note that even if the algorithm converges, it is not guaranteed that it has reached a global minimum. Local minimum can occur both with Sethi's and Kozak's algorithms (Rivest and Baillargeon, 2009).
Value
bh |
A vector of the |
Nh |
A vector of length |
nh |
A vector of length |
n |
The total sample size ( |
nhnonint |
A vector of length |
certain.info |
A vector giving statistics for the certainty stratum (see |
opti.criteria |
The final value of the criteria to optimize : either |
meanh |
A vector of length |
varh |
A vector of length |
mean |
A numeric: the anticipated global mean value of |
RMSE |
A numeric: the root mean squared error (or standard error if |
RRMSE |
A numeric: the anticipated relative root mean squared error (or coefficient of variation if |
relativebias |
A numeric: the anticipated relative bias of the estimator, i.e. ( |
propbiasMSE |
A numeric: the proportion of the MSE attributable to the bias of the estimator, i.e. ( |
stratumID |
A factor, having the same length as the input |
takeall |
The number of take-all strata in the final solution. Note: It is possible that |
call |
The function call (object of class "call"). |
date |
A character string that contains the system date and time when the function ended. |
args |
A list of all the argument values input to the function or set by default. |
iter.detail |
Only if an iterative algorithm is run (thus not int the output when a complete enumeration is performed): a data frame giving, for each iteration of the algorithm: |
niter |
Only if an iterative algorithm is run: the total number of iterations before convergence of the algorithm. For Kozak's algorithm, this number includes all the iterations, even the ones with discarded boundaries. |
converge |
Only if an iterative algorithm is run: a logical indicating if the algorithm has converged (see Details). |
run.detail |
Only for Kozak's algorithm when it is run more than once: a data frame giving the final solution for every run of the algorithm. It contains, for each run: |
run.min |
Only for Kozak's algorithm when it is run more than once: the identification number of every algorithm run leading to the optimal plan, i.e. the rows of |
sol.detail |
Only for Kozak's algorithm when the number of possible sets of boundaries is lower than |
sol.min |
Only for Kozak's algorithm when the number of possible sets of boundaries is lower than |
nsol |
Only for Kozak's algorithm : the number of possible sets of boundaries (see Details). |
Author(s)
Sophie Baillargeon Sophie.Baillargeon@mat.ulaval.ca and
Louis-Paul Rivest Louis-Paul.Rivest@mat.ulaval.ca
References
Baillargeon, S., Rivest, L.-P., Ferland, M. (2007). Stratification en enquetes entreprises : Une revue et quelques avancees. Proceedings of the Survey Methods Section, 2007 SSC Annual Meeting.
Baillargeon, S. and Rivest, L.-P. (2009). A general algorithm for univariate stratification. International Stratification Review, 77(3), 331-344.
Baillargeon, S. and Rivest L.-P. (2011). The construction of stratified designs in R with the package stratification. Survey Methodology, 37(1), 53-65.
Dalenius, T. and Hodges, J.L., Jr. (1959). Minimum variance stratification. Journal of the American Statistical Association, 54, 88-101.
Gunning, P. and Horgan, J.M. (2004). A new algorithm for the construction of stratum boundaries in skewed populations. Survey Methodology, 30(2), 159-166.
Gunning, P. and Horgan, J.M. (2007). Improving the Lavallee and Hidiroglou algorithm for stratification of skewed populations. Journal of Statistical Computation and Simulation, 77(4), 277-291.
Kozak, M. (2004). Optimal stratification using random search method in agricultural surveys. Statistics in Transition, 6(5), 797-806.
Lavallee, P. and Hidiroglou, M.A. (1988). On the stratification of skewed populations. Survey Methodology, 14, 33-43.
Rivest, L.-P. (1999). Stratum jumpers: Can we avoid them? Proceedings of the Section on Survey Methods Research of the American Statistical Association, 64-72.
Rivest, L.-P. (2002). A generalization of the Lavallee and Hidiroglou algorithm for stratification in business surveys. Survey Methodology, 28(2), 191-198.
Sethi, V. K. (1963). A note on optimum stratification of populations for estimating the population means. The Australian Journal of Statistics, 5, 20-33.
See Also
print.strata
, plot.strata
, strata.cumrootf
, strata.geo
Examples
#########################################################
### Sethi's algorithm versus Kozak's algorithm
# LACK OF CONVERGENCE
# Here is an example of numerical difficulties met with Sethi but not with Kozak
Sethi <- strata.LH(x=UScities, CV=0.01, Ls=3, alloc=c(0.35,0.35,0), takenone=0, takeall=1,
rh=1, model="loglinear", model.control=list(beta=1, sig2=0.5, ph=0.85),
algo="Sethi", algo.control=list(maxiter=20))
Sethi
Sethi$iter.detail[1:5,]
# Kozak's algorithm with arithmetic initial boundaries
# (default initial boundaries for Sethi's algorithm)
Kozak<-strata.LH(x=UScities, initbh=c(18,27), CV=0.01, Ls=3, alloc=c(0.35,0.35,0),
takenone=0, takeall=1, rh=1, model="loglinear",
model.control=list(beta=1, sig2=0.5, ph=0.85), algo="Kozak")
Kozak
Kozak$iter.detail[Kozak$iter.detail[,"run"]==Kozak$run.min[1],]
# Looking at the iteration history for the optimization with Sethi and Kozak,
# we see that the initial boundaries are very close from the optimal ones.
# Kozak reaches very quickly a minimum. However, Sethi increases n instead of
# minimizing it and afterwards it oscillates between two sets of boundaries
# without converging.
# LOCAL MINIMUM
# In this example, Sethi's algorithm obviously reaches a local minimum since Kozak
# proposes a much smaller n.
Sethi<-strata.LH(x=UScities, CV=0.01, Ls=4, alloc=c(0.5,0,0), takenone=0, takeall=1,
rh=0.85, model="loglinear", model.control=list(beta=1.1, sig2=0, ph=1),
algo="Sethi")
Sethi
Kozak<-strata.LH(x=UScities, CV=0.01, Ls=4, alloc=c(0.5,0,0), takenone=0, takeall=1,
rh=0.85, model="loglinear", model.control=list(beta=1.1, sig2=0, ph=1),
algo="Kozak")
Kozak
#########################################################
### Take-none stratum
# As illustrated in the following example (presented in Baillargeon and Rivest 2011),
# it is sometimes beneficial to include a take-none stratum in the stratified design
# (possibly with a bias penalty lower than 1).
notn <- strata.LH(x=MRTS, CV=0.1, Ls=3, alloc=c(0.5,0,0.5))
notn
tn1 <- strata.LH(x=MRTS, CV=0.1, Ls=3, alloc=c(0.5,0,0.5), takenone=1)
tn1
tn0.5 <- strata.LH(x=MRTS, CV=0.1, Ls=3, alloc=c(0.5,0,0.5), takenone=1, bias.penalty=0.5)
tn0.5
# Note: Sethi does not converge here. This occurs often with a take-none stratum.
tn1.Sethi <- strata.LH(x=MRTS, CV=0.1, Ls=3, alloc=c(0.5,0,0.5), takenone=1, algo="Sethi")
tn1.Sethi
#########################################################
### Automatic detection of a take-all stratum
# # As in the following example, a beneficial take-all stratum is not always detected
# # by the algorithm. Therefore, it is often a good idea to obtain a stratified design
# # with and without a take-all stratum and to compare the results.
# sans<-strata.LH(x=UScities, n=300, Ls=3, alloc=c(0.35,0.35,0), takeall=0,
# model="loglinear", model.control=list(beta=0.9, sig2=0, ph=1),
# algo.control=list(trymany=FALSE))
# sans
# avec<-strata.LH(x=UScities, n=300, Ls=3, alloc=c(0.35,0.35,0), takeall=1,
# model="loglinear", model.control=list(beta=0.9,sig2=0,ph=1),
# algo.control=list(trymany=FALSE))
# avec
# # We see that for the target sample size, the anticipated CV is 17% lower with a
# # take-all stratum (0.01081053 vs 0.009002313).
#########################################################
### Models for the discrepancy between Y and X
# LOGLINEAR MODEL WITH MORTALITY: see help(Sweden)
# HETEROSCEDASTIC LINEAR MODEL: We fix gamma=2.
beta.lin <- mean(Sweden$RMT85/Sweden$REV84)
sig2.lin <- var(Sweden$RMT85/Sweden$REV84)
strata.LH(x=Sweden$REV84, CV=0.05, Ls=5, alloc=c(0.5,0,0.5), takeall=1,
model="linear", model.control=list(beta=beta.lin, sig2=sig2.lin, gamma=2),
algo="Kozak")
# Verification of equation 3.6 of Rivest (2002)
beta.log <- 1
sig2.log <- log(1+sig2.lin/beta.lin^2)
strata.LH(x=Sweden$REV84, CV=0.05, Ls=5, alloc=c(0.5,0,0.5), takeall=1,
model="loglinear", model.control=list(beta=beta.log, sig2=sig2.log, ph=1),
algo="Kozak")
# The two models give the same stratified design.
# RANDOM REPLACEMENT MODEL: example in Rivest (1999)
strata.LH(x=Sweden$REV84, CV=0.05, Ls=5, alloc=c(0.5,0,0.5), takeall=1,
model="none", algo="Sethi") # Table 1 with a different rounding of the nh's
e0 <- strata.LH(x=Sweden$REV84, CV=0.05, Ls=5, alloc=c(0.5,0,0.5), takeall=1,
model="none", algo="Kozak")
e0 # Better than Sethi
var.strata(e0, y=Sweden$RMT85)
e0.001 <- strata.LH(x=Sweden$REV84, CV=0.05, Ls=5, alloc=c(0.5,0,0.5), takeall=1,
model="random", model.control=list(epsilon=0.011), algo="Kozak")
e0.001 # Table 2 part 3 with a different rounding of the nh's
var.strata(e0.001 ,y=Sweden$RMT85)