selMtc.by.unc {StatMatch}R Documentation

Identifies the best combination if matching variables in reducing uncertainty in estimation the contingency table Y vs. Z.

Description

This function identifies the “best” subset of matching variables in terms of reduction of uncertainty when estimating relative frequencies in the contingency table Y vs. Z. The sequential procedure presented in D'Orazio et al. (2017 and 2019) is implemented. This procedure avoids exploring all the possible combinations of the available X variables as in Fbwidths.by.x.

Usage

selMtc.by.unc(tab.x, tab.xy, tab.xz, corr.d=2, 
                    nA=NULL, nB=NULL, align.margins=FALSE) 

Arguments

tab.x

A R table crossing the X variables. This table must be obtained by using the function xtabs or table, e.g.
tab.x <- xtabs(~x1+x2+x3, data=data.all). A minimum number of 3 variables is needed.

tab.xy

A R table of X vs. Y variable. This table must be obtained by using the function xtabs or table, e.g.
table.xy <- xtabs(~x1+x2+x3+y, data=data.A).

A single categorical Y variables is allowed. At least three categorical variables should be considered as X variables (common variables). The same X variables in tab.x must be available in tab.xy. Usually, it is assumed that the joint distribution of the X variables computed from tab.xy is equal to tab.x (a warning appears if any absolute difference is greater than tol). Note that when the marginal distribution of X in tab.xy is not equal to that of tab.x it is possible to align them before computations (see argument align.margins).

tab.xz

A R table of X vs. Z variable. This table must be obtained by using the function xtabs or table, e.g.
tab.xz <- xtabs(~x1+x2+x3+z, data=data.B).

A single categorical Z variable is allowed. At least three categorical variables should be considered as X variables (common variables). The same X variables in tab.x must be available in tab.xz. Usually, it is assumed that the joint distribution of the X variables computed from tab.xz is equal to tab.x (a warning appears if any absolute difference is greater than tol). Note that when the marginal distribution of X in tab.xz is not equal to that of tab.x it is possible to align them before computations (see argument align.margins).

corr.d

Integer, indicates the penalty that should be introduced in estimating the uncertainty by means of the average width of cell bounds. When corr.d=1 the penalty being considered is the one introduced in D'Orazio et al. (2017) (i.e. penalty1 in Fbwidths.by.x). When corr.d=2 (default) it is considered a penalty suggested in D'Orazio et al. (2019) (indicated as “penalty2” in Fbwidths.by.x). Finally, no penalties are considered when corr.d=0.

nA

Integer, sample size of file A used to estimate tab.xy. If NULL is obtained as sum of frequencies in tab.xy.

nB

Integer, sample size of file B used to estimate tab.xz. If NULL is obtained as sum of frequencies in tab.xz.

align.margins

Logical (default FALSE). When when TRUE the distribution of X variables in tab.xy is aligned with the distribution resulting from tab.x, without affecting the marginal distribution of Y. Similarly the distribution of X variables in tab.xz is aligned with the distribution resulting from tab.x, without affecting the marginal distribution of Z. The alignment is performed by running IPF algorithm as implemented in the function Estimate in the package mipfp. To avoid lack of convergence due to combinations of Xs encountered in one table but not in the other (statistical 0s), before running IPF a small constant (1e-06) is added to empty cells in tab.xy and tab.xz.

Details

This function follows the sequential procedure described in D'Orazio et al. (2017, 2019) to identify the combination of common variables most effective in reducing uncertainty when estimating the contingency table Y vs. Z. Initially, the available Xs are ordered according to the reduction of average width of uncertainty bounds when conditioning on each of them. Then in each step one the remaining X variables is added until the table became too sparse; in practice the procedure stops when:

min\left[ \frac{n_A}{H_{D_m} \times J}, \frac{n_B}{H_{D_m} \times K} \right] \leq 1

For major details see also Fbwidths.by.x.

Value

A list with the main outcomes of the procedure.

ini.ord

Average width of uncertainty bounds when conditioning on each of the available X variables. Variable most effective in reducing uncertainty comes first. The ordering determines the order in which they are entered in the sequential procedure.

list.xs

List with the various combinations of the matching variables being considered in each step.

av.df

Data.frame with all the relevant information for each of combination of X variables. The last row corresponds to the combination of the X variables identified as the best in reducing average width of uncertainty bounds (penalized or not depending on the input argument corr.d). For each combination of X variables the following additional information are reported: the number of cells (name starts with “nc”); the number of empty cells (name starts with “nc0”; the average relative frequency (name starts with “av.crf”); sparseness measured as Cohen's effect size with respect to equiprobability (uniform distribution across cells). Finally there are the value of the stopping criterion (“min.av”), the unconditioned average width of uncertainty bounds (“avw”), the penalty term (“penalty”) and the penalized width (“avw.pen”; avw.pen=avw+penalty).

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

D'Orazio, M., Di Zio, M. and Scanu, M. (2017). “The use of uncertainty to choose matching variables in statistical matching”. International Journal of Approximate Reasoning, 90, pp. 433-440.

D'Orazio, M., Di Zio, M. and Scanu, M. (2019). “Auxiliary variable selection in a a statistical matching problem”. In Zhang, L.-C. and Chambers, R. L. (eds.) Analysis of Integrated Data, Chapman & Hall/CRC (forthcoming).

See Also

Fbwidths.by.x, Frechet.bounds.cat

Examples


data(quine, package="MASS") #loads quine from MASS
str(quine)
quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100))
table(quine$c.Days)


# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(1111)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:4]
quine.B <- quine[-lab.A, c(1:3,6)]

# compute the tables required by Fbwidths.by.x()
freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A)
freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B)

freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A)
freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B)

# apply Fbwidths.by.x()
bb <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, 
                           tab.xy=freq.xy,  tab.xz=freq.xz,
                           warn=FALSE)
bb$sum.unc
cc <- selMtc.by.unc(tab.x=freq.xA+freq.xB, 
                           tab.xy=freq.xy,  tab.xz=freq.xz, corr.d=0)
cc$ini.ord
cc$av.df



[Package StatMatch version 1.4.2 Index]