rec {revengc}R Documentation

Reverse engineering censored and decoupled data

Description

rec is a statistical approach that estimates what "true" uncensored bivariate table could have been given summarized information. Note, there are assumptions used in this function. First, rec relies on a Poisson distribution when a user only provides an average, which is assuming the variance and average of that variable are equal. A more descriptive input variable(s), such as a decoupled univariate table(s) or a censored frequency table, can account for dispersion found in data. However, independence between decoupled variables still has to be assumed when there is no external information about the joint distribution. Because of these assumptions, rec also provides two options for sensitivity analysis: seed matrix the method used in updating the seed matrix are both arbitrary inputs. For more information it is recommended for a user to read the Details section below and more information can be found in the vignettes.

Usage

rec(X, Y, Xlowerbound, Xupperbound, Ylowerbound, Yupperbound, 
    seed.matrix, seed.estimation.method)

Arguments

X

Argument can be an average, a univariate frequency table, or a censored contingency table. The average value should be a numeric class while a data.frame or matrix are acceptable table classes. Y defaults to NULL if X argument is a censored contingency table. See Details section below formatting.

Y

Same description as X but this argument is for the Y variable. X defaults to NULL if Y argument is a censored contingency table.

Xlowerbound

A numeric class value to represent the left bound for X (row in contingency table). The value must strictly be a non-negative integer and cannot be greater than the lowest category/average value provided for X (e.g. the lower bound cannot be 6 if a table has '< 5' as a X or row category).

Xupperbound

A numeric class value to represent the right bound for X (row in contingency table). The value cannot be less than the highest category/average value provided for X (e.g. the upper bound cannot be 90 if a table has '> 100' as a X or row category).

Ylowerbound

Same description as Xlowerbound but this argument is for Y (column in contingency table).

Yupperbound

Same description as Xupperbound but this argument is for Y (column in contingency table).

seed.matrix

An initial probability matrix to be updated. If decoupled variables is provided the default is a Xlowerbound:Xupperbound by Ylowerbound:Yupperbound matrix with interior cells of 1, which are then converted to probabilities. If a censored contingency table is provided the default is the seedmatrix()\$Probabilities output.

seed.estimation.method

A character string indicating which method is used for updating the seed.matrix. The choices are: "ipfp", "ml", "chi2", or "lsq". Default is "ipfp".

Details

Overview:
The rec function handles four cases.

Table Format:
The table(s) for Case II and III has restrictions. The univariate frequency table, which can be a data.frame or matrix class, must have two columns and n number of rows. The categories must be in the first column with frequencies or probabilities in the second column. Row names should never be placed in this table (the default row names should always be 1:n). Column names can be any character string. The only symbols accepted for censored data are listed below. Note, less than or equal to (<= and LE) is not equivalent to less than (< and L) and greater than or equal to (>=, +, and GE) is not equivalent to greater than (> and G). Also, calculations use closed intervals.

Below are three correctly formatted tables.

Category Frequency
<=6 11800
7-12 57100
13-19 14800
20+ 3900
Category Frequency
LE6 11800
7I12 57100
13I19 14800
GE20 3900
Category Frequency
<7 11800
7I12 57100
13-19 14800
>=20 3900

The table for Case IV also has restrictions. The censored symbols should follow the requirements listed above. The table's class can be a data.frame or a matrix. The column names should be the Y category values. The first column should be the X category values and the row names can be arbitrary. The inside of the table are X * Y frequency values, which are either nonnegative frequencies or probabilities if seed_estimation_method is "ipfp" or strictly positive when method is "ml", "lsq" or "chi2". The row and column marginal totals corresponding to their X and Y category values need to be placed in this table. The top left, top right, and bottom left corners of the table should be NA or blank. The bottom right corner can be a total cross tabulation sum value, NA, or blank. The table below is a formatted example.

NA <20 20-30 >30 NA
<5 18 19 8 45
5-9 13 8 12 33
>=10 7 5 10 21
NA 38 32 31 NA

Bounds:
Ideally, the four bounds should be chosen based off prior knowledge and expert elicitation, but they can also be selected intuitively with a brute force method. If rec outputs a final contingency table with higher probabilities near the edge(s) of the table, then it would make sense to increase the range of the bound(s). For both the X and Y variables, this would just involve making the lower bound less, making the upper bound more, or doing a combination of the two. The opposite holds true as well. If the final contingency table has very low probabilities near the edge(s) of the table, then a user should decrease the range of the particular bound(s).

Seed Estimation Methods:
This function implements the mipfp R package, which offers four methods to estimate cross tabulations when provided fixed marginals.

Method Abbreviation
Iterative proportional fitting procedure ipfp
Maximum likelihood method ml
Minimum chi-squared chi2
Weighted least squares lsq

For a summary and understanding of all methods please refer to the vignettes and/or the papers by Little et al. (1991) and Suesse et al. (2017).

Value

The output is a list containing an uncensored contingency table of probabilities (rows range from Xlowerbound:Xupperbound and the columns range from Ylowerbound:Yupperbound) as well as the row X and column Y parameters used in making the margins for the mipfp R package.

References

Frederick Novomestky and Saralees Nadarajah (2016). truncdist: Truncated Random Variables. R package version 1.0-2. https://CRAN.R-project.org/package=truncdist

Johan Barthelemy and Thomas Suesse (2018). mipfp: Multidimensional Iterative Proportional Fitting and Alternative Models. R package version 3.2. https://CRAN.R-project.org/package=mipfp

Little, R. J., Wu, M. M. (1991) Models for contingency tables with known margins when target and sampled populations differ. Journal of the American Statistical Association, 86(413): 87-95. doi: https://doi.org/10.2307/2289718

Suesse, T., Namazi-Rad, M., Mokhtarian, P., & Barthelemy, J. (2017). Estimating Cross-Classified Population Counts of Multidimensional Tables: An Application to Regional Australia to Obtain Pseudo-Census Counts, Journal of Official Statistics, 33(4), 1021-1050. doi: https://doi.org/10.1515/jos-2017-0048

Examples

  # provide two averages
  # seed.matrix defaults to a matrix of ones
  # seed.estimation.method defaults to ipfp
  twoaverages.results<-rec(
     X= 4.4,
     Y = 571.3,
     Xlowerbound = 1,
     Xupperbound = 20,
     Ylowerbound = 520,
     Yupperbound = 620)
  
  
  # provide one average and one table
  # create a censored univariate table
  # seed.matrix defaults to a matrix of ones
  # seed.estimation.method defaults to ipfp
  Y.table = cbind(as.character(c("<7", "7-12", "13-19", ">19")), 
    c(11800,57100,14800,3900))
  combo.results<-rec(X= 2.3,
     Y = Y.table,
     Xlowerbound = 1,
     Xupperbound = 15,
     Ylowerbound = 1,
     Yupperbound = 30)
   
   
  # provide a censored contingency table 
  contingencytable<-matrix(c(6185,9797,16809,11126,6156,3637,908,147,69,4,
                         5408,12748,26506,21486,14018,9165,2658,567,196,78,
                         7403,20444,44370,36285,23576,15750,4715,994,364,136,
                         4793,17376,44065,40751,28900,20404,6557,1296,555,228,
                         2354,11143,32837,33910,26203,19301,6835,1438,618,245,
                         1060,6038,19256,21298,17774,13864,4656,1039,430,178,
                         273,2521,9110,11188,9626,7433,2608,578,196,112,
                         119,1130,4183,5566,5053,3938,1367,318,119,66,
                         33,388,1707,2367,2328,1972,719,171,68,37,
                         38,178,1047,1672,1740,1666,757,193,158,164),
                           nrow=10,ncol=10, byrow=TRUE)
  rowmarginal<-apply(contingencytable,1,sum)
  contingencytable<-cbind(contingencytable, rowmarginal)
  colmarginal<-apply(contingencytable,2,sum)
  contingencytable<-rbind(contingencytable, colmarginal)
  row.names(contingencytable)[row.names(contingencytable)=="colmarginal"]<-""
  contingencytable<-data.frame(c("1","2","3","4","5","6", "7", "8","9","10+", NA),
    contingencytable)
  colnames(contingencytable)<-c(NA,"<20","20-29","30-39","40-49","50-69","70-99",
                                "100-149","150-199","200-299","300+", NA)

  # the contingencytable input could be put in X or Y (opposing argument = NULL)
  # X = rows and Y = columns 
  # seed.matrix default = repeating the cross tabulations in the censored contingency
  ## table for the newly created and compatible uncensored cross tabulations
  # seed.estimation.method defaults to ipfp
  contingencytable.results<-rec(
     X= contingencytable,
     Xlowerbound = 1,
     Xupperbound = 15,
     Ylowerbound = 10,
     Yupperbound = 310)

[Package revengc version 1.0.4 Index]