pertRates {SynthTools} | R Documentation |
Calculates perturbation rates of overall data set and specific variables.
Description
This function will calculate the overall perturbation rate of an imputed data set and for specific variables requested.
Usage
pertRates(obs_data, new_data, imp_vars, desc = FALSE, sig = 4)
Arguments
obs_data |
The original dataset to which the next will be compared, of the type "data.frame". |
new_data |
The fully or partially synthetic data set to be compared to the observed data, of the type "data.frame". |
imp_vars |
The variable or a vector of variables which were imputed and are to be used in the overall perturbation rate calculation. |
desc |
Whether or not the variable perturbation rates should be output in descending rate order. Defaults to FALSE. |
sig |
The number of significant digits desired for the overall perturbation rate. Defaults to 4. |
Details
A record in a data set is considered "perturbed" when at least one value in the record is different from the observed data. The overall perturbation rate is therefore the number of records that are found to be perturbed over the number of records in a data set.
The variable perturbation rate is simply the rate at which the values for a given variable are different from those in the observed data set.
This function was developed with the intention of making the job of researching synthetic data utility a bit easier by quickly calculating perturbation rates.
Value
Returns the overall perturbation rate of the synthetic data set and the specific variable perturbation rates in percentages, rounded to 0.1. The function will also output in list format with the following components:
overall |
The overall perturbation rate. |
variable |
A vector of variable perturbation rates. |
Examples
#PPA is observed data set, PPAps2 is a partially synthetic data set derived from the observed data.
#age17plus, marriage, and vet are three categorical variables within these data sets.
pertRates(PPA, PPAps2, c("age17plus", "marriage", "vet"))