selMtc.by.unc {StatMatch} | R Documentation |
Identifies the best combination if matching variables in reducing uncertainty in estimation the contingency table Y vs. Z.
Description
This function identifies the “best” subset of matching variables in terms of reduction of uncertainty when estimating relative frequencies in the contingency table Y vs. Z. The sequential procedure presented in D'Orazio et al. (2017 and 2019) is implemented. This procedure avoids exploring all the possible combinations of the available X variables as in Fbwidths.by.x
.
Usage
selMtc.by.unc(tab.x, tab.xy, tab.xz, corr.d=2,
nA=NULL, nB=NULL, align.margins=FALSE)
Arguments
tab.x |
A R table crossing the X variables. This table must be obtained by using the function |
tab.xy |
A R table of X vs. Y variable. This table must be obtained by using the function A single categorical Y variables is allowed. At least three categorical variables should be considered as X variables (common variables). The same X variables in |
tab.xz |
A R table of X vs. Z variable. This table must be obtained by using the function A single categorical Z variable is allowed. At least three categorical variables should be considered as X variables (common variables). The same X variables in |
corr.d |
Integer, indicates the penalty that should be introduced in estimating the uncertainty by means of the average width of cell bounds. When |
nA |
Integer, sample size of file A used to estimate |
nB |
Integer, sample size of file B used to estimate |
align.margins |
Logical (default |
Details
This function follows the sequential procedure described in D'Orazio et al. (2017, 2019) to identify the combination of common variables most effective in reducing uncertainty when estimating the contingency table Y vs. Z. Initially, the available Xs are ordered according to the reduction of average width of uncertainty bounds when conditioning on each of them. Then in each step one the remaining X variables is added until the table became too sparse; in practice the procedure stops when:
min\left[ \frac{n_A}{H_{D_m} \times J}, \frac{n_B}{H_{D_m} \times K} \right] \leq 1
For major details see also Fbwidths.by.x
.
Value
A list with the main outcomes of the procedure.
ini.ord |
Average width of uncertainty bounds when conditioning on each of the available X variables. Variable most effective in reducing uncertainty comes first. The ordering determines the order in which they are entered in the sequential procedure. |
list.xs |
List with the various combinations of the matching variables being considered in each step. |
av.df |
Data.frame with all the relevant information for each of combination of X variables. The last row corresponds to the combination of the X variables identified as the best in reducing average width of uncertainty bounds (penalized or not depending on the input argument |
Author(s)
Marcello D'Orazio mdo.statmatch@gmail.com
References
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
D'Orazio, M., Di Zio, M. and Scanu, M. (2017). “The use of uncertainty to choose matching variables in statistical matching”. International Journal of Approximate Reasoning, 90, pp. 433-440.
D'Orazio, M., Di Zio, M. and Scanu, M. (2019). “Auxiliary variable selection in a a statistical matching problem”. In Zhang, L.-C. and Chambers, R. L. (eds.) Analysis of Integrated Data, Chapman & Hall/CRC (forthcoming).
See Also
Fbwidths.by.x
, Frechet.bounds.cat
Examples
data(quine, package="MASS") #loads quine from MASS
str(quine)
quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100))
table(quine$c.Days)
# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(1111)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:4]
quine.B <- quine[-lab.A, c(1:3,6)]
# compute the tables required by Fbwidths.by.x()
freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A)
freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B)
freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A)
freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B)
# apply Fbwidths.by.x()
bb <- Fbwidths.by.x(tab.x=freq.xA+freq.xB,
tab.xy=freq.xy, tab.xz=freq.xz,
warn=FALSE)
bb$sum.unc
cc <- selMtc.by.unc(tab.x=freq.xA+freq.xB,
tab.xy=freq.xy, tab.xz=freq.xz, corr.d=0)
cc$ini.ord
cc$av.df