civis_ml_random_forest_classifier {civis} R Documentation

## CivisML Random Forest Classifier

### Description

CivisML Random Forest Classifier

### Usage

civis_ml_random_forest_classifier(
x,
dependent_variable,
primary_key = NULL,
excluded_columns = NULL,
n_estimators = 500,
criterion = c("gini", "entropy"),
max_depth = NULL,
min_samples_split = 2,
min_samples_leaf = 1,
min_weight_fraction_leaf = 0,
max_features = "sqrt",
max_leaf_nodes = NULL,
min_impurity_split = 1e-07,
bootstrap = TRUE,
random_state = 42,
class_weight = NULL,
fit_params = NULL,
cross_validation_parameters = NULL,
calibration = NULL,
oos_scores_table = NULL,
oos_scores_db = NULL,
oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
model_name = NULL,
cpu_requested = NULL,
memory_requested = NULL,
disk_requested = NULL,
polling_interval = NULL,
verbose = FALSE,
civisml_version = "prod"
)


### Arguments

 x See the Data Sources section below. dependent_variable The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped. primary_key Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In predict.civis_ml, the primary_key of the training task is used by default primary_key = NA. Use primary_key = NULL to explicitly indicate the data have no primary_key. excluded_columns Optional, a vector of columns which will be considered ineligible to be independent variables. n_estimators The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting, so a large number usually results in better predictive performance. criterion The function to measure the quality of a split. Supported criteria are gini for the Gini impurity and entropy for the information gain. max_depth Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance. The best value depends on the interaction of the input variables. min_samples_split The minimum number of samples required to split an internal node. If an integer, then consider min_samples_split as the minimum number. If a float, then min_samples_split is a percentage and ceiling(min_samples_split * n_samples) are the minimum number of samples for each split. min_samples_leaf The minimum number of samples required to be in a leaf node. If an integer, then consider min_samples_leaf as the minimum number. If a float, the min_samples_leaf is a percentage and ceiling(min_samples_leaf * n_samples) are the minimum number of samples for each leaf node. min_weight_fraction_leaf The minimum weighted fraction of the sum total of weights required to be at a leaf node. max_features The number of features to consider when looking for the best split. integerconsider max_features at each split. floatthen max_features is a percentage and max_features * n_features are considered at each split. autothen max_features = sqrt(n_features) sqrtthen max_features = sqrt(n_features) log2then max_features = log2(n_features) NULLthen max_features = n_features max_leaf_nodes Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction to impurity. If max_leaf_nodes = NULL then unlimited number of leaf nodes. min_impurity_split Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf. bootstrap Whether bootstrap samples are used when building trees. random_state The seed of the random number generator. class_weight A list with class_label = value pairs, or balanced. When class_weight = "balanced", the class weights will be inversely proportional to the class frequencies in the input data as:  \frac{n_samples}{n_classes * table(y)}  Note, the class weights are multiplied with sample_weight (passed via fit_params) if sample_weight is specified. fit_params Optional, a mapping from parameter names in the model's fit method to the column names which hold the data, e.g. list(sample_weight = 'survey_weight_column'). cross_validation_parameters Optional, parameter grid for learner parameters, e.g. list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3)) or "hyperband" for supported models. calibration Optional, if not NULL, calibrate output probabilities with the selected method, sigmoid, or isotonic. Valid only with classification models. oos_scores_table Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename". oos_scores_db Optional, the name of the database where the oos_scores_table will be created. If not provided, this will default to database_name. oos_scores_if_exists Optional, action to take if oos_scores_table already exists. One of "fail", "append", "drop", or "truncate". The default is "fail". model_name Optional, the prefix of the Platform modeling jobs. It will have " Train" or " Predict" added to become the Script title. cpu_requested Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU. memory_requested Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB. disk_requested Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB. notifications Optional, model status notifications. See scripts_post_custom for further documentation about email and URL notification. polling_interval Check for job completion every this number of seconds. verbose Optional, If TRUE, supply debug outputs in Platform logs and make prediction child jobs visible. civisml_version Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production

### Value

A civis_ml object, a list containing the following elements:

 job job metadata from scripts_get_custom. run run metadata from scripts_get_custom_runs. outputs CivisML metadata from scripts_list_custom_runs_outputs containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators. metrics Parsed CivisML output from metrics.json containing metadata from validation. A list containing the following elements: run list, metadata about the run. data list, metadata about the training data. model list, the fitted scikit-learn model with CV results. metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc). warnings list. data_platform list, training data location. model_info Parsed CivisML output from model_info.json containing metadata from training. A list containing the following elements: run list, metadata about the run. data list, metadata about the training data. model list, the fitted scikit-learn model. metrics empty list. warnings list. data_platform list, training data location.

### Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame

civis_ml(x = df, ...)

local csv file

civis_ml(x = "path/to/data.csv", ...)

file in Civis Platform

civis_ml(x = civis_file(1234))

table in Civis Platform

civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

### Examples

## Not run:
df <- iris
names(df) <- gsub("\\.", "_", names(df))

m <- civis_ml_random_forest_classifier(df,
dependent_variable = "Species",
n_estimators = 100,
max_depth = 5,
max_features = NULL)
yhat <- fetch_oos_scores(m)

# Grid Search
cv_params <- list(
n_estimators = c(100, 200, 500),
max_depth = c(2, 3))

m <- civis_ml_random_forest_classifier(df,
dependent_variable = "Species",
max_features = NULL,
cross_validation_parameters = cv_params)

pred_info <- predict(m,  civis_table("schema.table", "my_database"),
output_table = "schema.scores_table")

## End(Not run)


[Package civis version 3.0.0 Index]