R: build all treatments for a data frame to predict a numeric...

designTreatmentsN {vtreat}

R Documentation

build all treatments for a data frame to predict a numeric outcome

Description

Function to design variable treatments for binary prediction of a numeric outcome. Data frame is assumed to have only atomic columns except for dates (which are converted to numeric). Note: each column is processed independently of all others. Note: re-encoding high cardinality on training data categorical variables can introduce undesirable nested model bias, for such data consider using mkCrossFrameNExperiment.

Usage

designTreatmentsN(
  dframe,
  varlist,
  outcomename,
  ...,
  weights = c(),
  minFraction = 0.02,
  smFactor = 0,
  rareCount = 0,
  rareSig = NULL,
  collarProb = 0,
  codeRestriction = NULL,
  customCoders = NULL,
  splitFunction = NULL,
  ncross = 3,
  forceSplit = FALSE,
  verbose = TRUE,
  parallelCluster = NULL,
  use_parallel = TRUE,
  missingness_imputation = NULL,
  imputation_map = NULL
)

Arguments

`dframe`	Data frame to learn treatments from (training data), must have at least 1 row.
`varlist`	Names of columns to treat (effective variables).
`outcomename`	Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice.
`...`	no additional arguments, declared to forced named binding of later arguments
`weights`	optional training weights for each row
`minFraction`	optional minimum frequency a categorical level must have to be converted to an indicator column.
`smFactor`	optional smoothing factor for impact coding models.
`rareCount`	optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off.
`rareSig`	optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off.
`collarProb`	what fraction of the data (pseudo-probability) to collar data at if doCollar is set during `prepare.treatmentplan`.
`codeRestriction`	what types of variables to produce (character array of level codes, NULL means no restriction).
`customCoders`	map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md).
`splitFunction`	(optional) see vtreat::buildEvalSets .
`ncross`	optional scalar >=2 number of cross validation splits use in rescoring complex variables.
`forceSplit`	logical, if TRUE force cross-validated significance calculations on all variables.
`verbose`	if TRUE print progress.
`parallelCluster`	(optional) a cluster object created by package parallel or package snow.
`use_parallel`	logical, if TRUE use parallel methods (when parallel cluster is set).
`missingness_imputation`	function of signature f(values: numeric, weights: numeric), simple missing value imputer.
`imputation_map`	map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers.

Details

The main fields are mostly vectors with names (all with the same names in the same order):

- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame - sig : an estimate significance of effect

See the vtreat vignette for a bit more detail and a worked example.

Columns that do not vary are not passed through.

Value

treatment plan (for use with prepare)

Examples


dTrainN <- data.frame(x=c('a','a','a','a','b','b','b'),
    z=c(1,2,3,4,5,6,7),y=c(0,0,0,1,0,1,1))
dTestN <- data.frame(x=c('a','b','c',NA),
    z=c(10,20,30,NA))
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
dTestNTreated <- prepare(treatmentsN,dTestN,pruneSig=0.99)

[Package vtreat version 1.6.5 Index]