R: Compare data frames and Plot Differences

uni_compare {sampcompR}

R Documentation

Compare data frames and Plot Differences

Description

Returns data or a plot showing the difference of two or more data frames The differences are calculated on the base of differing metrics, chosen in the funct argument. All used data frames must contain at least one column named equal in all data frames, that has equal values.

Usage

uni_compare(
  dfs,
  benchmarks,
  variables = NULL,
  nboots = 2000,
  funct = "rel_mean",
  data = TRUE,
  type = "comparison",
  legendlabels = NULL,
  legendtitle = NULL,
  colors = NULL,
  shapes = NULL,
  summetric = "rmse2",
  label_x = NULL,
  label_y = NULL,
  plot_title = NULL,
  varlabels = NULL,
  name_dfs = NULL,
  name_benchmarks = NULL,
  summet_size = 4,
  silence = TRUE,
  conf_level = 0.95,
  conf_adjustment = NULL,
  weight = NULL,
  id = NULL,
  strata = NULL,
  weight_bench = NULL,
  id_bench = NULL,
  strata_bench = NULL,
  adjustment_weighting = "raking",
  adjustment_vars = NULL,
  raking_targets = NULL,
  post_targets = NULL,
  ndigits = 3
)

Arguments

`dfs`	A character vector containing the names of data frames to compare against the benchmarks.
`benchmarks`	A character vector containing the names of benchmarks to compare the data frames against. The vector must either be the same length as `dfs`, or length 1. If it has length 1 every df will be compared against the same benchmark. Benchmarks can either be the name of data frames or the name of a list of tables. The tables in the list need to be named as the respective variables in the data frame of comparison.
`variables`	A character vector containing the names of the variables for the comparison. If NULL, all variables named similarly in both the `dfs` and the benchmarks will be compared. Variables missing in one of the data frames or the benchmarks will be neglected for this comparison.
`nboots`	The number of bootstraps used to calculate standard errors. Must either be >2 or 0. If >2 bootstrapping is used to calculate standard errors with `nboots` iterations. If 0, SE is calculated analytically. We do not recommend using `nboots` =0 because this method is not yet suitable for every `funct` used and every method. Depending on the size of the data and the number of bootstraps, `uni_compare` can take a while.
`funct`	A character string, indicating the function to calculate the difference between the data frames. Predefined functions are: `"d_mean"`, `"ad_mean"` A function to calculate the (absolute) difference in mean of the variables in `dfs` and benchmarks with the same name. Only applicable for metric variables. `"d_prop"`, `"ad_prop"` A function to calculate the (absolute) difference in proportions of the variables in `dfs` and benchmarks with the same name. Only applicable for dummy variables. `"rel_mean"`, `"abs_rel_mean"` A function to calculate the (absolute) relative difference in mean of the variables in `dfs` and benchmarks with the same name. #' For more information on the formula for difference and analytic variance, see Felderer et al. (2019). Only applicable for metric variables. `"rel_prop"`, `"abs_rel_prop"` A function to calculate the (absolute) relative difference in proportions of the variables in `dfs` and benchmarks with the same name. It is calculated similar to the relative difference in mean (see Felderer et al., 2019), however the default label for the plot is different. Only applicable for dummy variables.
`data`	If TRUE, a uni_compare_object is returned, containing results of the comparison.
`type`	Define the type of comparison. Can either be `"comparison"` or `"nonresponse"`.
`legendlabels`	A character string or vector of strings containing a label for the legend.
`legendtitle`	A character string containing the title of the legend.
`colors`	A vector of colors, that is used in the plot for the different comparisons.
`shapes`	A vector of shapes applicable in `ggplot2::ggplot2()`, that is used in the plot for the different comparisons.
`summetric`	If `"avg1"`, `"mse1"`, `"rmse1"`, or `"R"` the respective measure is calculated for the biases of each survey. The values `"mse1"` and `"rmse1"` lead to similar results as in `"mse2"` and `"rmse2"`, with slightly different visualization in the plot. If `summetric = NULL`, no summetric will be displayed in the Plot. When `"R"` is chosen, also `response_identificator` is needed.
`label_x`, `label_y`	A character string or vector of character strings containing a label for the x-axis and y-axis.
`plot_title`	A character string containing the title of the plot.
`varlabels`	A character string or vector of character strings containing the new names of variables, also used in plot.
`name_dfs`, `name_benchmarks`	A character string or vector of character strings containing the new names of the `dfs` and `benchmarks`, that is also used in plot.
`summet_size`	A number to determine the size of the displayed `summetric` in the plot.
`silence`	If `silence = FALSE` a warning will be displayed, if variables a re excluded from either the data frame or benchmark, for not existing in both.
`conf_level`	A numeric value between zero and one to determine the confidence level of the confidence interval.
`conf_adjustment`	If `conf_adjustment = TRUE` the confidence level of the confidence interval will be adjusted with a Bonferroni adjustment, to account for the problem of multiple comparisons.
`weight`, `weight_bench`	A character vector determining variables to weight the `dfs` or `benchmarks`. They have to be part of the respective data frame. If only one character is provided, the same variable is used to weigh every `df` or `benchmark`. If a weight variable is provided also an `id` variable is needed.For weighting, the `survey` package is used.
`id`, `id_bench`	A character vector determining `id` variables used to weigh the `dfs` or `benchmarks` with the help of the `survey` package. They have to be part of the respective data frame. If only one character is provided, the same variable is used to weigh every `df` or `benchmark`.
`strata`, `strata_bench`	A character vector determining strata variables used to weigh the `dfs` or `benchmarks` with the help of the `survey` package. They have to be part of the respective data frame. If only one character is provided, the same variable is used to weight every `df` or `benchmark`.
`adjustment_weighting`	A character vector indicating if adjustment weighting should be used. It can either be `"raking"` or `"post_start"`.
`adjustment_vars`	Variables used to adjust the survey when using raking or post stratification.
`raking_targets`	A list of raking targets that can be given to the rake function of `rake`, to rake the `dfs`.
`post_targets`	A list of post-stratification targets that can be given to the `postStratify` function, to post-stratify the `dfs`.
`ndigits`	The number of digits to round the numbers in the plot.

Value

A plot based on ggplot2::ggplot2() (or data frame if data==TRUE) which shows the difference between two or more data frames on predetermined variables, named identical in both data frames.

References

Felderer, B., Kirchner, A., & Kreuter, FALSE. (2019). The Effect of Survey Mode on Data Quality: Disentangling Nonresponse and Measurement Error Bias. Journal of Official Statistics, 35(1), 93–115. https://doi.org/10.2478/jos-2019-0005

Examples


## Get Data for comparison
require(wooldridge)
card<-wooldridge::card

black<-wooldridge::card[wooldridge::card$black==1,]
north<-wooldridge::card[wooldridge::card$south==0,]
white<-wooldridge::card[wooldridge::card$black==0,]
south<-wooldridge::card[wooldridge::card$south==1,]

## use the function to plot the data 
univar_comp<-sampcompR::uni_compare(dfs = c("north","white"),
                                    benchmarks = c("south","black"),
                                    variables= c("age","educ","fatheduc","motheduc","wage","IQ"),
                                    funct = "abs_rel_mean",
                                    nboots=200,
                                    summetric="rmse2",
                                    data=FALSE)

 univar_comp

[Package sampcompR version 0.2.1 Index]