synthetic_new_attribute {synthACS} | R Documentation |
Add a new attribute to a synthetic_micro dataset
Description
Add a new attribute to a synthetic_micro dataset using conditional relationships between the new attribute and existing attributes (eg. wage rate conditioned on age and education level).
Usage
synthetic_new_attribute(
df,
prob_name = "p",
attr_name = "variable",
conditional_vars = NULL,
sym_tbl = NULL
)
Arguments
df |
An R object of class "synthetic_micro". |
prob_name |
A string specifying the column name of the |
attr_name |
A string specifying the desired name of the new attribute to be added to the data. |
conditional_vars |
An character vector specifying the existing variables, if any, on which
the new attribute (variable) is to be conditioned on. Variables must be specified in order.
Defaults to |
sym_tbl |
sym_tbl A |
Value
A new synthetic_micro dataset with class "synthetic_micro".
Details
New synthetic variables are introduced to the existing data via conditional probability. Similar
to derive_synth_datasets
, the goal with this function is to generate a joint
probability distribution for an attribute vector; and, to create synthetic individuals from
this distribution. Although no limit is placed on the number of variables on which to condition,
in practice, data rarely exists which allows more than two or three conditioning variables. Other
variables are assumed to be independent from the new attribute.
** There are four different types of conditional/marginal probability models which may be considered for a given new attribute: (1) Independence: it is assumed that each of the variables is independent of the others (2) Pairwise conditional independence: it is assumed that attributes are related to only one other attribute and independent of all others. (3) Conditional independence: Attributes can be depedent on some subset of other attributes and independent of the rest. (4) In the most general case, all attributes are jointly interrelated.
Conditioning is implemented via symbol-tables (sym_tbl
) to ensure accurate matching between
conditioning variables, new attribute levels, and new attribute probabilities. The symbol table
is constructed such that the key in the symbol-table's key-value pair is the specific values for
the set of conditioning variables. This key is the first N columns of sym_tbl
. A
recursive approach is employed to conditionally partition sym_tbl
. In this sense, the
*order* in which the conditional variables are supplied matters.
The value is final 2 columns of sym_tbl
which are a pair of (A) either counts or percentages
used to specify the probability for the new attribute and (B) the level that the new attribute takes on.
Examples
{
set.seed(567L)
df <- data.frame(gender= factor(sample(c("male", "female"), size= 100, replace= TRUE)),
edu= factor(sample(c("LT_college", "BA_degree"), size= 100, replace= TRUE)),
p= runif(100))
df$p <- df$p / sum(df$p)
class(df) <- c("data.frame", "micro_synthetic")
ST <- data.frame(gender= c(rep("male", 3), rep("female", 3)),
attr_pct= c(0.1, 0.8, 0.1, 0.05, 0.7, 0.25),
levels= rep(c("low", "middle", "high"), 2))
df2 <- synthetic_new_attribute(df, prob_name= "p", attr_name= "SES", conditional_vars= "gender",
sym_tbl= ST)
ST2 <- data.frame(gender= c(rep("male", 3), rep("female", 6)),
edu= c(rep(NA, 3), rep(c("LT_college", "BA_degree"), each= 3)),
attr_pct= c(0.1, 0.8, 0.1, 10, 80, 10, 5, 70, 25),
levels= rep(c("low", "middle", "high"), 3))
df2 <- synthetic_new_attribute(df, prob_name= "p", attr_name= "SES",
conditional_vars= c("gender", "edu"),
sym_tbl= ST2)
}