R: CivisML Random Forest Classifier

civis_ml_random_forest_classifier {civis}

R Documentation

CivisML Random Forest Classifier

Description

CivisML Random Forest Classifier

Usage

civis_ml_random_forest_classifier(
  x,
  dependent_variable,
  primary_key = NULL,
  excluded_columns = NULL,
  n_estimators = 500,
  criterion = c("gini", "entropy"),
  max_depth = NULL,
  min_samples_split = 2,
  min_samples_leaf = 1,
  min_weight_fraction_leaf = 0,
  max_features = "sqrt",
  max_leaf_nodes = NULL,
  min_impurity_split = 1e-07,
  bootstrap = TRUE,
  random_state = 42,
  class_weight = NULL,
  fit_params = NULL,
  cross_validation_parameters = NULL,
  calibration = NULL,
  oos_scores_table = NULL,
  oos_scores_db = NULL,
  oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
  model_name = NULL,
  cpu_requested = NULL,
  memory_requested = NULL,
  disk_requested = NULL,
  notifications = NULL,
  polling_interval = NULL,
  verbose = FALSE,
  civisml_version = "prod"
)

Arguments

`x`	See the Data Sources section below.
`dependent_variable`	The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
`primary_key`	Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In `predict.civis_ml`, the primary_key of the training task is used by default `primary_key = NA`. Use `primary_key = NULL` to explicitly indicate the data have no primary_key.
`excluded_columns`	Optional, a vector of columns which will be considered ineligible to be independent variables.
`n_estimators`	The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting, so a large number usually results in better predictive performance.
`criterion`	The function to measure the quality of a split. Supported criteria are `gini` for the Gini impurity and `entropy` for the information gain.
`max_depth`	Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance. The best value depends on the interaction of the input variables.
`min_samples_split`	The minimum number of samples required to split an internal node. If an integer, then consider `min_samples_split` as the minimum number. If a float, then `min_samples_split` is a percentage and `ceiling(min_samples_split * n_samples)` are the minimum number of samples for each split.
`min_samples_leaf`	The minimum number of samples required to be in a leaf node. If an integer, then consider `min_samples_leaf` as the minimum number. If a float, the `min_samples_leaf` is a percentage and `ceiling(min_samples_leaf * n_samples)` are the minimum number of samples for each leaf node.
`min_weight_fraction_leaf`	The minimum weighted fraction of the sum total of weights required to be at a leaf node.
`max_features`	The number of features to consider when looking for the best split. integer consider `max_features` at each split. float then `max_features` is a percentage and `max_features * n_features` are considered at each split. auto then `max_features = sqrt(n_features)` sqrt then `max_features = sqrt(n_features)` log2 then `max_features = log2(n_features)` NULL then `max_features = n_features`
`max_leaf_nodes`	Grow trees with `max_leaf_nodes` in best-first fashion. Best nodes are defined as relative reduction to impurity. If `max_leaf_nodes = NULL` then unlimited number of leaf nodes.
`min_impurity_split`	Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
`bootstrap`	Whether bootstrap samples are used when building trees.
`random_state`	The seed of the random number generator.
`class_weight`	A `list` with `class_label = value` pairs, or `balanced`. When `class_weight = "balanced"`, the class weights will be inversely proportional to the class frequencies in the input data as: `\frac{n_samples}{n_classes * table(y)}` Note, the class weights are multiplied with `sample_weight` (passed via `fit_params`) if `sample_weight` is specified.
`fit_params`	Optional, a mapping from parameter names in the model's `fit` method to the column names which hold the data, e.g. `list(sample_weight = 'survey_weight_column')`.
`cross_validation_parameters`	Optional, parameter grid for learner parameters, e.g. `list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3))` or `"hyperband"` for supported models.
`calibration`	Optional, if not `NULL`, calibrate output probabilities with the selected method, `sigmoid`, or `isotonic`. Valid only with classification models.
`oos_scores_table`	Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".
`oos_scores_db`	Optional, the name of the database where the `oos_scores_table` will be created. If not provided, this will default to `database_name`.
`oos_scores_if_exists`	Optional, action to take if `oos_scores_table` already exists. One of `"fail"`, `"append"`, `"drop"`, or `"truncate"`. The default is `"fail"`.
`model_name`	Optional, the prefix of the Platform modeling jobs. It will have `" Train"` or `" Predict"` added to become the Script title.
`cpu_requested`	Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.
`memory_requested`	Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.
`disk_requested`	Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.
`notifications`	Optional, model status notifications. See `scripts_post_custom` for further documentation about email and URL notification.
`polling_interval`	Check for job completion every this number of seconds.
`verbose`	Optional, If `TRUE`, supply debug outputs in Platform logs and make prediction child jobs visible.
`civisml_version`	Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production

Value

A civis_ml object, a list containing the following elements:

`job`	job metadata from `scripts_get_custom`.
`run`	run metadata from `scripts_get_custom_runs`.
`outputs`	CivisML metadata from `scripts_list_custom_runs_outputs` containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.
`metrics`	Parsed CivisML output from `metrics.json` containing metadata from validation. A list containing the following elements: run list, metadata about the run. data list, metadata about the training data. model list, the fitted scikit-learn model with CV results. metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc). warnings list. data_platform list, training data location.
`model_info`	Parsed CivisML output from `model_info.json` containing metadata from training. A list containing the following elements: run list, metadata about the run. data list, metadata about the training data. model list, the fitted scikit-learn model. metrics empty list. warnings list. data_platform list, training data location.

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame: civis_ml(x = df, ...)
local csv file: civis_ml(x = "path/to/data.csv", ...)
file in Civis Platform: civis_ml(x = civis_file(1234))
table in Civis Platform: civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

Examples

## Not run: 
 df <- iris
 names(df) <- gsub("\\.", "_", names(df))

 m <- civis_ml_random_forest_classifier(df,
   dependent_variable = "Species",
   n_estimators = 100,
   max_depth = 5,
   max_features = NULL)
 yhat <- fetch_oos_scores(m)

# Grid Search
cv_params <- list(
   n_estimators = c(100, 200, 500),
   max_depth = c(2, 3))

 m <- civis_ml_random_forest_classifier(df,
   dependent_variable = "Species",
   max_features = NULL,
   cross_validation_parameters = cv_params)

pred_info <- predict(m,  civis_table("schema.table", "my_database"),
   output_table = "schema.scores_table")

## End(Not run)

[Package civis version 3.1.2 Index]