designTreatmentsN {vtreat} | R Documentation |
build all treatments for a data frame to predict a numeric outcome
Description
Function to design variable treatments for binary prediction of a
numeric outcome. Data frame is assumed to have only atomic columns
except for dates (which are converted to numeric).
Note: each column is processed independently of all others.
Note: re-encoding high cardinality on training data
categorical variables can introduce undesirable nested model bias, for such data consider
using mkCrossFrameNExperiment
.
Usage
designTreatmentsN(
dframe,
varlist,
outcomename,
...,
weights = c(),
minFraction = 0.02,
smFactor = 0,
rareCount = 0,
rareSig = NULL,
collarProb = 0,
codeRestriction = NULL,
customCoders = NULL,
splitFunction = NULL,
ncross = 3,
forceSplit = FALSE,
verbose = TRUE,
parallelCluster = NULL,
use_parallel = TRUE,
missingness_imputation = NULL,
imputation_map = NULL
)
Arguments
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
outcomename |
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice. |
... |
no additional arguments, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction |
what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders |
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar >=2 number of cross validation splits use in rescoring complex variables. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods (when parallel cluster is set). |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
Details
The main fields are mostly vectors with names (all with the same names in the same order):
- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame - sig : an estimate significance of effect
See the vtreat vignette for a bit more detail and a worked example.
Columns that do not vary are not passed through.
Value
treatment plan (for use with prepare)
See Also
prepare.treatmentplan
, designTreatmentsC
, designTreatmentsZ
, mkCrossFrameNExperiment
Examples
dTrainN <- data.frame(x=c('a','a','a','a','b','b','b'),
z=c(1,2,3,4,5,6,7),y=c(0,0,0,1,0,1,1))
dTestN <- data.frame(x=c('a','b','c',NA),
z=c(10,20,30,NA))
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
dTestNTreated <- prepare(treatmentsN,dTestN,pruneSig=0.99)