R: Run a numeric cross frame experiment.

mkCrossFrameNExperiment {vtreat}

R Documentation

Run a numeric cross frame experiment.

Description

Builds a designTreatmentsN treatment plan and a data frame prepared from dframe that is "cross" in the sense each row is treated using a treatment plan built from a subset of dframe disjoint from the given row. The goal is to try to and supply a method of breaking nested model bias other than splitting into calibration, training, test sets.

Usage

mkCrossFrameNExperiment(
  dframe,
  varlist,
  outcomename,
  ...,
  weights = c(),
  minFraction = 0.02,
  smFactor = 0,
  rareCount = 0,
  rareSig = 1,
  collarProb = 0,
  codeRestriction = NULL,
  customCoders = NULL,
  scale = FALSE,
  doCollar = FALSE,
  splitFunction = NULL,
  ncross = 3,
  forceSplit = FALSE,
  verbose = TRUE,
  parallelCluster = NULL,
  use_parallel = TRUE,
  missingness_imputation = NULL,
  imputation_map = NULL
)

Arguments

`dframe`	Data frame to learn treatments from (training data), must have at least 1 row.
`varlist`	Names of columns to treat (effective variables).
`outcomename`	Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice.
`...`	no additional arguments, declared to forced named binding of later arguments
`weights`	optional training weights for each row
`minFraction`	optional minimum frequency a categorical level must have to be converted to an indicator column.
`smFactor`	optional smoothing factor for impact coding models.
`rareCount`	optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off.
`rareSig`	optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off.
`collarProb`	what fraction of the data (pseudo-probability) to collar data at if doCollar is set during `prepare.treatmentplan`.
`codeRestriction`	what types of variables to produce (character array of level codes, NULL means no restriction).
`customCoders`	map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md).
`scale`	optional if TRUE replace numeric variables with regression ("move to outcome-scale").
`doCollar`	optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design.
`splitFunction`	(optional) see vtreat::buildEvalSets .
`ncross`	optional scalar>=2 number of cross-validation rounds to design.
`forceSplit`	logical, if TRUE force cross-validated significance calculations on all variables.
`verbose`	if TRUE print progress.
`parallelCluster`	(optional) a cluster object created by package parallel or package snow.
`use_parallel`	logical, if TRUE use parallel methods.
`missingness_imputation`	function of signature f(values: numeric, weights: numeric), simple missing value imputer.
`imputation_map`	map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers.

Value

named list containing: treatments, crossFrame, crossWeights, method, and evalSets

Examples


# numeric example
set.seed(23525)

# we set up our raw training and application data
dTrainN <- data.frame(
  x = c('a', 'a', 'a', 'a', 'b', 'b', NA, NA),
  z = c(1, 2, 3, 4, 5, NA, 7, NA), 
  y = c(0, 0, 0, 1, 0, 1, 1, 1))

dTestN <- data.frame(
  x = c('a', 'b', 'c', NA), 
  z = c(10, 20, 30, NA))

# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsN
# and dTrainNTreated
unpack[
  treatmentsN = treatments,
  dTrainNTreated = crossFrame
  ] <- mkCrossFrameNExperiment(
    dframe = dTrainN,
    varlist = setdiff(colnames(dTrainN), 'y'),
    outcomename = 'y',
    verbose = FALSE)

# the treatments include a score frame relating new
# derived variables to original columns
treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
  print(.)

# the treated frame is a "cross frame" which
# is a transform of the training data built 
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
dTrainNTreated %.>%
  head(.) %.>%
  print(.)

# Any future application data is prepared with
# the prepare method.
dTestNTreated <- prepare(treatmentsN, dTestN, pruneSig=NULL)

dTestNTreated %.>%
  head(.) %.>%
  print(.)