create.formula {formulaic}R Documentation

Create Formula

Description

Create formula is a tool to automatically create a formula object from a provided variable and output names. Reduces the time required to manually input variables for modeling. Output can be used in linear regression, random forest, neural network etc. Create formula becomes useful when modeling data with multiple features. Reduces the time required for modeling and implementation :

Usage

create.formula(
  outcome.name,
  input.names = NULL,
  input.patterns = NULL,
  dat = NULL,
  interactions = NULL,
  force.main.effects = TRUE,
  reduce = FALSE,
  max.input.categories = 20,
  max.outcome.categories.to.search = 4,
  order.as = "as.specified",
  include.backtick = "as.needed",
  format.as = "formula",
  variables.to.exclude = NULL,
  include.intercept = TRUE
)

Arguments

outcome.name

A character value specifying the name of the formula's outcome variable. In this version, only a single outcome may be ed. The first entry of outcome.name will be used to build the formula.

input.names

The names of the variables with the full names delineated. User can specify '.' or 'all' to e all the column variables.

input.patterns

es additional input variables. The user may enter patterns – e.g. to e every variable with a name that es the pattern. Multiple patterns may be ed as a character vector. However, each pattern may not contain spaces and is otherwise subject to the same limits on patterns as used in the grep function.

dat

User can specify a data.frame object that will be used to remove any variables that are not listed in names(dat. As default it is set as NULL. In this case, the formula is created simply from the outcome.name and input.names.

interactions

A list of character vectors. Each character vector es the names of the variables that form a single interaction. Specifying interactions = list(c("x", "y"), c("x", "z"), c("y", "z"), c("x", "y", "z")) would lead to the interactions x*y + x*z + y*z + x*y*z.

force.main.effects

This is a logical value. When TRUE, the intent is that any term ed as an interaction (of multiple variables) must also be listed individually as a main effect.

reduce

A logical value. When dat is not NULL and reduce is TRUE, additional quality checks are performed to examine the input variables. Any input variables that exhibit a lack of contrast will be excluded from the model. This search is global by default but may be conducted separately in subsets of the outcome variables by specifying max.outcome.categories.to.search. Additionally, any input variables that exhibit too many contrasts, as defined by max.input.categories, will also be excluded.

max.input.categories

Limits the maximum number of variables that will be employed in the formula. As default it is set at 20, but users can still change at his/her convenience.

max.outcome.categories.to.search

A numeric value. The create.formula function es a feature that identifies input variables exhibiting a lack of contrast. When reduce = TRUE, these variables are automatically excluded from the resulting formula. This search may be expanded to subsets of the outcome when the number of unique measured values of the outcome is no greater than max.outcome.categories.to.search. In this case, each subset of the outcome will be separately examined, and any inputs that exhibit a lack of contrast within at least one subset will be excluded.

order.as

User can specify the order the input variables in the formula in a variety of ways for patterns: increasing for increasing alphabet order, decreasing for decreasing alphabet order, column.order for as they appear in data, and as.specified for maintaining the user's specified order.

include.backtick

Add backticks if needed. As default it is set as 'as.needed', which add backticks when only it is needed. The other option is 'all'. The use of include.backtick = "all" is limited to cases in which the output is generated as a character variable. When the output is generated as a formula object, then R automatically removes all unnecessary backticks. That is, it is only compatible when format.as != formula.

format.as

The data type of the output. If not set as "formula", then a character vector will be returned.

variables.to.exclude

A character vector. Any variable specified in variables.to.exclude will be dropped from the formula, both in the individual inputs and in any associated interactions. This step supersedes the inclusion of any variables specified for inclusion in the other parameters.

include.intercept

A logical value. When FALSE, the intercept will be removed from the formula.

Details

Return as the data type of the output. If not set as "formula", then a character vector will be returned. The input.names and names of variables matching the input.patterns will be concatenated to form the full list of input variables.

Examples

 n <- 10
 dd <- data.table::data.table(w = rnorm(n= n), x = rnorm(n = n), pixel_1 = rnorm(n = n))
 dd[, pixel_2 := 0.3 * pixel_1 + rnorm(n)]
 dd[, y := 5 * x + 3 * pixel_1 + 2 * pixel_2 + rnorm(n)]

 create.formula(outcome.name = "y", input.names = "x", input.patterns = c("pi", "xel"), dat = dd)

[Package formulaic version 0.0.8 Index]