prepare.treatmentplan {vtreat} | R Documentation |
Apply treatments and restrict to useful variables.
Description
Use a treatment plan to prepare a data frame for analysis. The
resulting frame will have new effective variables that are numeric
and free of NaN/NA. If the outcome column is present it will be copied over.
The intent is that these frames are compatible with more machine learning
techniques, and avoid a lot of corner cases (NA,NaN, novel levels, too many levels).
Note: each column is processed independently of all others. Also copies over outcome if present.
Note: treatmentplan's are not meant for long-term storage, a warning is issued if the version of
vtreat that produced the plan differs from the version running prepare()
.
Usage
## S3 method for class 'treatmentplan'
prepare(
treatmentplan,
dframe,
...,
pruneSig = NULL,
scale = FALSE,
doCollar = FALSE,
varRestriction = NULL,
codeRestriction = NULL,
trackedValues = NULL,
extracols = NULL,
parallelCluster = NULL,
use_parallel = TRUE,
check_for_duplicate_frames = TRUE
)
Arguments
treatmentplan |
Plan built by designTreantmentsC() or designTreatmentsN() |
dframe |
Data frame to be treated |
... |
no additional arguments, declared to forced named binding of later arguments |
pruneSig |
suppress variables with significance above this level |
scale |
optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for variables with significant less than 1) slope 1 when regressed (lm for regression problems/glm for classification problems) against outcome. |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
varRestriction |
optional list of treated variable names to restrict to |
codeRestriction |
optional list of treated variable codes to restrict to |
trackedValues |
optional named list mapping variables to know values, allows warnings upon novel level appearances (see |
extracols |
extra columns to copy. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
check_for_duplicate_frames |
logical, if TRUE check if we called prepare on same data.frame as design step. |
Value
treated data frame (all columns numeric- without NA, NaN)
See Also
mkCrossFrameCExperiment
, mkCrossFrameNExperiment
, designTreatmentsC
designTreatmentsN
designTreatmentsZ
, prepare
Examples
# categorical example
set.seed(23525)
# we set up our raw training and application data
dTrainC <- data.frame(
x = c('a', 'a', 'a', 'b', 'b', NA, NA),
z = c(1, 2, 3, 4, NA, 6, NA),
y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE))
dTestC <- data.frame(
x = c('a', 'b', 'c', NA),
z = c(10, 20, 30, NA))
# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsC
# and dTrainCTreated
unpack[
treatmentsC = treatments,
dTrainCTreated = crossFrame
] <- mkCrossFrameCExperiment(
dframe = dTrainC,
varlist = setdiff(colnames(dTrainC), 'y'),
outcomename = 'y',
outcometarget = TRUE,
verbose = FALSE)
# the treatments include a score frame relating new
# derived variables to original columns
treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
print(.)
# the treated frame is a "cross frame" which
# is a transform of the training data built
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
dTrainCTreated %.>%
head(.) %.>%
print(.)
# Any future application data is prepared with
# the prepare method.
dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL)
dTestCTreated %.>%
head(.) %.>%
print(.)