mkCrossFrameNExperiment {vtreat} | R Documentation |
Run a numeric cross frame experiment.
Description
Builds a designTreatmentsN
treatment plan and a data frame prepared
from dframe
that is "cross" in the sense each row is treated using a treatment
plan built from a subset of dframe disjoint from the given row.
The goal is to try to and supply a method of breaking nested model bias other than splitting
into calibration, training, test sets.
Usage
mkCrossFrameNExperiment(
dframe,
varlist,
outcomename,
...,
weights = c(),
minFraction = 0.02,
smFactor = 0,
rareCount = 0,
rareSig = 1,
collarProb = 0,
codeRestriction = NULL,
customCoders = NULL,
scale = FALSE,
doCollar = FALSE,
splitFunction = NULL,
ncross = 3,
forceSplit = FALSE,
verbose = TRUE,
parallelCluster = NULL,
use_parallel = TRUE,
missingness_imputation = NULL,
imputation_map = NULL
)
Arguments
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
outcomename |
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice. |
... |
no additional arguments, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction |
what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders |
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
scale |
optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar>=2 number of cross-validation rounds to design. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
Value
named list containing: treatments, crossFrame, crossWeights, method, and evalSets
See Also
designTreatmentsC
, designTreatmentsN
, prepare.treatmentplan
Examples
# numeric example
set.seed(23525)
# we set up our raw training and application data
dTrainN <- data.frame(
x = c('a', 'a', 'a', 'a', 'b', 'b', NA, NA),
z = c(1, 2, 3, 4, 5, NA, 7, NA),
y = c(0, 0, 0, 1, 0, 1, 1, 1))
dTestN <- data.frame(
x = c('a', 'b', 'c', NA),
z = c(10, 20, 30, NA))
# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsN
# and dTrainNTreated
unpack[
treatmentsN = treatments,
dTrainNTreated = crossFrame
] <- mkCrossFrameNExperiment(
dframe = dTrainN,
varlist = setdiff(colnames(dTrainN), 'y'),
outcomename = 'y',
verbose = FALSE)
# the treatments include a score frame relating new
# derived variables to original columns
treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
print(.)
# the treated frame is a "cross frame" which
# is a transform of the training data built
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
dTrainNTreated %.>%
head(.) %.>%
print(.)
# Any future application data is prepared with
# the prepare method.
dTestNTreated <- prepare(treatmentsN, dTestN, pruneSig=NULL)
dTestNTreated %.>%
head(.) %.>%
print(.)