R: Almost Matching Exactly (AME) Algorithms for Discrete,...

AME {FLAME}

R Documentation

Almost Matching Exactly (AME) Algorithms for Discrete, Observational Data

Description

Almost Matching Exactly (AME) Algorithms for Discrete, Observational Data

Usage

FLAME(
  data,
  holdout = 0.1,
  C = 0.1,
  treated_column_name = "treated",
  outcome_column_name = "outcome",
  weights = NULL,
  PE_method = "ridge",
  user_PE_fit = NULL,
  user_PE_fit_params = NULL,
  user_PE_predict = NULL,
  user_PE_predict_params = NULL,
  replace = FALSE,
  estimate_CATEs = FALSE,
  verbose = 2,
  return_pe = FALSE,
  return_bf = FALSE,
  early_stop_iterations = Inf,
  early_stop_epsilon = 0.25,
  early_stop_control = 0,
  early_stop_treated = 0,
  early_stop_pe = Inf,
  early_stop_bf = 0,
  missing_data = c("none", "drop", "keep", "impute"),
  missing_holdout = c("none", "drop", "impute"),
  missing_data_imputations = 1,
  missing_holdout_imputations = 5,
  impute_with_treatment = TRUE,
  impute_with_outcome = FALSE
)

DAME(
  data,
  holdout = 0.1,
  treated_column_name = "treated",
  outcome_column_name = "outcome",
  weights = NULL,
  PE_method = "ridge",
  n_flame_iters = 0,
  user_PE_fit = NULL,
  user_PE_fit_params = NULL,
  user_PE_predict = NULL,
  user_PE_predict_params = NULL,
  replace = FALSE,
  estimate_CATEs = FALSE,
  verbose = 2,
  return_pe = FALSE,
  return_bf = FALSE,
  early_stop_iterations = Inf,
  early_stop_epsilon = 0.25,
  early_stop_control = 0,
  early_stop_treated = 0,
  early_stop_pe = Inf,
  early_stop_bf = 0,
  missing_data = c("none", "drop", "keep", "impute"),
  missing_holdout = c("none", "drop", "impute"),
  missing_data_imputations = 1,
  missing_holdout_imputations = 5,
  impute_with_treatment = TRUE,
  impute_with_outcome = FALSE
)

## S3 method for class 'ame'
print(x, digits = getOption("digits"), linewidth = 80, ...)

Arguments

`data`	Data to be matched. Either a data frame or a path to a .csv file to be read into a data frame. Treatment must be described by a logical or binary numeric column with name `treated_column_name`. If supplied, outcome must be described by a column with name `outcome_column_name`. The outcome will be treated as continuous if numeric with more than two values, as binary if a two-level factor or numeric with values 0 and 1 exclusively, and as multi-class if a factor with more than two levels. If the outcome column is omitted, matching will be performed but treatment effect estimation will not be possible. All columns not containing outcome or treatment will be treated as covariates for matching. Covariates are assumed to be categorical and will be coerced to factors, though they may be passed as either factors or numeric; if the former, unused levels will automatically be dropped. If you wish to use continuous covariates for matching, they should be binned prior to matching.
`holdout`	Holdout data to be used to compute predictive error, if `weights` is not supplied. If a numeric scalar between 0 and 1, that proportion of `data` will be made into a holdout set and only the remaining proportion of `data` will be matched. Otherwise, a data frame or a path to a .csv file. The holdout data must contain an outcome column with name `outcome_column_name`; other restrictions on column types are as for `data`. Covariate columns must have the same column names and order as `data`. This data will not be matched. Defaults to 0.1.
`C`	A finite, positive scalar denoting the tradeoff between BF and PE in the FLAME algorithm. Higher C prioritizes more matches and lower C prioritizes not dropping important covariates. Defaults to 0.1.
`treated_column_name`	Name of the treatment column in `data` and `holdout`. Defaults to 'treated'.
`outcome_column_name`	Name of the outcome column in `holdout` and also in `data`, if supplied in the latter. Defaults to 'outcome'.
`weights`	A positive numeric vector representing covariate importances. Supplying this argument prevents PE from being computed as it determines dropping order by forcing covariate subsets with lower weights to be dropped first. The weight of a covariate subset is defined to be the sum of the weights of the constituent covariates. Ties are broken at random.
`PE_method`	Denotes how predictive error (PE) is to be computed. Either a string – one of "ridge" (default) or "xgb" – or a function. If "ridge", ridge regression is used to fit a an outcome regression model via `glmnet::cv.glmnet` with default parameters. If "xgb", gradient boosting with a wide range of parameter values to cross-validate is used via `xgboost::xgb.cv` and the best parameters with respect to RMSE (for continuous outcomes) or misclassification rate (for binary/multi-class outcomes) are chosen. In both cases, the default `predict` method is used to generate in-sample predictions. If a function, denotes a user-supplied function that should be used for computing PE. This function must be passed a data frame of covariates as its first argument and a vector of outcome values as its second argument. It must return a vector of in-sample predictions, which, if the outcome is binary or multi-class, must be maximum probability class labels. See below for examples.
`user_PE_fit`	Deprecated; use argument 'PE_method' instead. An optional function supplied by the user that can be used instead of those allowed for by `PE_method` to fit a model for the outcome from the covariates. This function will be passed a data frame of covariates as its first argument and a vector of outcome values as its second argument. See below for examples. Defaults to `NULL`.
`user_PE_fit_params`	Deprecated; use argument 'PE_method' instead. A named list of optional parameters to be used by `user_PE_fit`. Defaults to `NULL`.
`user_PE_predict`	Deprecated; use argument 'PE_method' instead. An optional function supplied by the user that can be used to generate predictions from the output of `user_PE_fit`. As its first argument, must take an object of the type returned by `user_PE_fit` and as its second, a matrix of values for which to generate predictions. When the outcome is binary or multi-class, must return the maximum probability class label. If not supplied, defaults to `predict`.
`user_PE_predict_params`	Deprecated; use argument 'PE_method' instead. A named list of optional parameters to be used by `user_PE_predict`. Defaults to `NULL`.
`replace`	A logical scalar. If `TRUE`, allows the same unit to be matched multiple times, on different sets of covariates. In this case, the balancing factor for `FLAME` is computing by dividing by the total number of treatment (control) units, instead of the number of unmatched treatment (control) units. Defaults to `FALSE`.
`estimate_CATEs`	A logical scalar. If `TRUE`, CATEs for each unit are estimated throughout the matching procedure, which will be much faster than computing them after a call to `FLAME` or `DAME` for very large inputs. Defaults to `FALSE`.
`verbose`	Controls how FLAME displays progress while running. If 0, no output. If 1, only outputs the stopping condition. If 2, outputs the iteration and number of unmatched units every 5 iterations, and the stopping condition. If 3, outputs the iteration and number of unmatched units every iteration, and the stopping condition. Defaults to 2.
`return_pe`	A logical scalar. If `TRUE`, the predictive error (PE) at each iteration will be returned. Defaults to `FALSE`.
`return_bf`	A logical scalar. If `TRUE`, the balancing factor (BF) at each iteration will be returned. Defaults to `FALSE`.
`early_stop_iterations`	A positive integer, denoting an upper bound on the number of matching rounds to be performed. If 1, one round of exact matching is performed before stopping. Defaults to `Inf`.
`early_stop_epsilon`	A nonnegative numeric. If fixed covariate weights are passed via `weights`, then the algorithm will stop before matching on a covariate set whose error is above `early_stop_epsilon`, where in this case the error is defined as: `1 - weight(covariate set matched on) / weight(all covariates)`. Otherwise, if `weights` is `NULL`, if FLAME or DAME attempts to drop a covariate set that would raise the PE above (1 + `early_stop_epsilon`) times the baseline PE (the PE before any covariates have been dropped), the algorithm will stop. Defaults to 0.25.
`early_stop_control`, `early_stop_treated`	If the proportion of control, treated units, respectively, that are unmatched falls below this value, the matching algorithm will stop. Default to 0.
`early_stop_pe`	Deprecated. A positive numeric. If FLAME attempts to drop a covariate that would lead to a PE above this value, FLAME stops. Defaults to `Inf`.
`early_stop_bf`	Deprecated. A numeric value between 0 and 2. If FLAME attempts to drop a covariate that would lead to a BF below this value, FLAME stops. Defaults to 0.
`missing_data`	Specifies how to handle missingness in `data`. If 'none' (default), assumes no missing data. If 'drop', effectively drops units with missingness from the data and does not match them (they will still appear in the matched dataset that is returned, however). If 'keep', keeps the missing values in the data; in this case, a unit can only match on sets containing covariates it is not missing. If 'impute', imputes the missing data via `mice::mice`.
`missing_holdout`	Specifies how to handle missingness in `holdout`. If 'none' (default), assumes no missing data; if 'drop', drops units with missingness and does not use them to compute PE; and if 'impute', imputes the missing data via `mice::mice`. In this last case, the PE at an iteration will be given by the average PE across all imputations.
`missing_data_imputations`	Defunct. If `missing_data` = 'impute', one round of imputation will be performed on `data` via `mice::mice`. To view results for multiple imputations, please wrap calls to `FLAME` or `DAME` in a loop. This argument will be removed in a future release.
`missing_holdout_imputations`	If `missing_holdout` = 'impute', performs this many imputations of the missing data in `holdout` via `mice::mice`. Defaults to 5.
`impute_with_treatment`, `impute_with_outcome`	If `TRUE`, use treatment, outcome, respectively, to impute covariates when either `missing_data` or `missing_holdout` is equal to `'impute'`. Default to `TRUE`, `FALSE`, respectively.
`n_flame_iters`	Specifies that this many iterations of FLAME should be run before switching to DAME. This can be used to speed up the matching procedure as FLAME rapidly eliminates irrelevant covariates, after which DAME will make higher quality matches on the remaining variables.
`x`	An object of class `ame`, returned by a call to `FLAME` or `DAME`.
`digits`	Number of significant digits for printing the average treatment effect.
`linewidth`	Maximum number of characters on line; output will be wrapped accordingly.
`...`	Additional arguments to be passed to other methods.

Value

An object of type ame, which by default is a list of 4 entries:

data

The original data frame with several modifications:

An extra logical column, data$matched, that indicates whether or not a unit was matched.
An extra numeric column, data$weight, that denotes on how many different sets of covariates a unit was matched. This will only be greater than 1 when replace = TRUE.
The columns denoting treatment and outcome will be moved after all covariate columns.
If replace is FALSE, a column containing a matched group identifier for each unit.
If, estimate_CATEs = TRUE, a column containing the CATE estimate for each unit.

MGs

A list whose i'th entry contains the indices of units in the main matched group of the i'th unit.

cov_sets

A list whose i'th entry contains the covariates set not matched on in the i'th iteration.

info

A list containing miscellaneous information about the data and matching specifications. Primarily for use by *.ame methods.

Introduction

FLAME and DAME are matching algorithms for observational causal inference on data with discrete (categorical) covariates. They match units that share identical values of certain covariates, as follows. The algorithms first make any possible exact matches; that is, they match units that share identical values of all covariates (this is possible because covariates are discrete). They then iteratively drop a set of covariates and make any possible matches on the remaining covariates, until stopping. For each unit, DAME solves an optimization problem that finds the highest quality set of covariates the unit can be matched to others on, where quality is determined by how well that set of covariates predicts the outcome. FLAME approximates the solution to the problem solved by DAME; at each step, it drops the covariate leading to the smallest drop in match quality MQ, defined as MQ = C · BF - PE. Here, PE denotes the predictive error, which measures how important the dropped covariate is for predicting the outcome. The balancing factor BF measures the number of matches formed by dropping that covariate. In this way, FLAME encourages matching on covariates more important to the outcome and also making many matches. The hyperparameter C controls the balance between these two objectives. In both cases, a machine learning algorithm trained on a holdout dataset is responsible for learning the quality / importance of covariates. For more details on the algorithms, please see the vignette, the FLAME paper here and/or the DAME paper here.

Stopping Rules

By default, both FLAME and DAME stop when 1. all covariates have been dropped or 2. all treatment or control units have been matched. This behavior can be modified by the arguments whose prefix is "early_stop". With the exception of early_stop_iterations, all the rules come into play before the offending covariate set is dropped. For example, if early_stop_control = 0.2 and at the current iteration, dropping the covariate leading to highest match quality is associated with a unmatched control proportion of 0.1, FLAME will stop without dropping this covariate.

Missing Data

FLAME and DAME offer functionality for handling missing data in the covariates, for both the data and holdout sets. This functionality can be specified via the arguments whose prefix is "missing" or "impute". It allows for ignoring missing data, imputing it, or (for data) not matching on missing values. If data is imputed, imputation will be done once and the matching algorithm will be run on the imputed dataset. If holdout is imputed, the predictive error at an iteration will be the average of predictive errors across all imputed holdout datasets. Units with missingness in the treatment or outcome will be dropped.

Examples

## Not run: 
data <- gen_data()
holdout <- gen_data()
# FLAME with replacement, stopping after dropping a single covariate
FLAME_out <- FLAME(data = data, holdout = holdout,
                   replace = TRUE, early_stop_iterations = 2)

# Use a linear model to compute predictive error. Call DAME without
# replacement, returning predictive error at each iteration.
my_PE <- function(X, Y) {
  return(lm(Y ~ ., as.data.frame(cbind(X, Y = Y)))$fitted.values)
}
DAME_out <- DAME(data = data, holdout = holdout,
                 PE_method = my_PE, return_PE = TRUE)

## End(Not run)

[Package FLAME version 2.1.1 Index]