| prepare.treatmentplan {vtreat} | R Documentation | 
Apply treatments and restrict to useful variables.
Description
Use a treatment plan to prepare a data frame for analysis.  The
resulting frame will have new effective variables that are numeric
and free of NaN/NA.  If the outcome column is present it will be copied over.
The intent is that these frames are compatible with more machine learning
techniques, and avoid a lot of corner cases (NA,NaN, novel levels, too many levels).
Note: each column is processed independently of all others.  Also copies over outcome if present.
Note: treatmentplan's are not meant for long-term storage, a warning is issued if the version of
vtreat that produced the plan differs from the version running prepare().
Usage
## S3 method for class 'treatmentplan'
prepare(
  treatmentplan,
  dframe,
  ...,
  pruneSig = NULL,
  scale = FALSE,
  doCollar = FALSE,
  varRestriction = NULL,
  codeRestriction = NULL,
  trackedValues = NULL,
  extracols = NULL,
  parallelCluster = NULL,
  use_parallel = TRUE,
  check_for_duplicate_frames = TRUE
)
Arguments
| treatmentplan | Plan built by designTreantmentsC() or designTreatmentsN() | 
| dframe | Data frame to be treated | 
| ... | no additional arguments, declared to forced named binding of later arguments | 
| pruneSig | suppress variables with significance above this level | 
| scale | optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for variables with significant less than 1) slope 1 when regressed (lm for regression problems/glm for classification problems) against outcome. | 
| doCollar | optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. | 
| varRestriction | optional list of treated variable names to restrict to | 
| codeRestriction | optional list of treated variable codes to restrict to | 
| trackedValues | optional named list mapping variables to know values, allows warnings upon novel level appearances (see  | 
| extracols | extra columns to copy. | 
| parallelCluster | (optional) a cluster object created by package parallel or package snow. | 
| use_parallel | logical, if TRUE use parallel methods. | 
| check_for_duplicate_frames | logical, if TRUE check if we called prepare on same data.frame as design step. | 
Value
treated data frame (all columns numeric- without NA, NaN)
See Also
mkCrossFrameCExperiment, mkCrossFrameNExperiment, designTreatmentsC designTreatmentsN designTreatmentsZ, prepare
Examples
# categorical example
set.seed(23525)
# we set up our raw training and application data
dTrainC <- data.frame(
  x = c('a', 'a', 'a', 'b', 'b', NA, NA),
  z = c(1, 2, 3, 4, NA, 6, NA),
  y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE))
dTestC <- data.frame(
  x = c('a', 'b', 'c', NA), 
  z = c(10, 20, 30, NA))
# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsC
# and dTrainCTreated
unpack[
  treatmentsC = treatments,
  dTrainCTreated = crossFrame
  ] <- mkCrossFrameCExperiment(
    dframe = dTrainC,
    varlist = setdiff(colnames(dTrainC), 'y'),
    outcomename = 'y',
    outcometarget = TRUE,
    verbose = FALSE)
# the treatments include a score frame relating new
# derived variables to original columns
treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
  print(.)
# the treated frame is a "cross frame" which
# is a transform of the training data built 
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
dTrainCTreated %.>%
  head(.) %.>%
  print(.)
# Any future application data is prepared with
# the prepare method.
dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL)
dTestCTreated %.>%
  head(.) %.>%
  print(.)