R: CivisML Sparse Ridge Regression

civis_ml_sparse_ridge_regressor {civis}

R Documentation

CivisML Sparse Ridge Regression

Description

CivisML Sparse Ridge Regression

Usage

civis_ml_sparse_ridge_regressor(
  x,
  dependent_variable,
  primary_key = NULL,
  excluded_columns = NULL,
  alpha = 1,
  fit_intercept = TRUE,
  normalize = FALSE,
  max_iter = NULL,
  tol = 0.001,
  solver = c("auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag"),
  random_state = 42,
  fit_params = NULL,
  cross_validation_parameters = NULL,
  oos_scores_table = NULL,
  oos_scores_db = NULL,
  oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
  model_name = NULL,
  cpu_requested = NULL,
  memory_requested = NULL,
  disk_requested = NULL,
  notifications = NULL,
  polling_interval = NULL,
  verbose = FALSE,
  civisml_version = "prod"
)

Arguments

`x`	See the Data Sources section below.
`dependent_variable`	The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
`primary_key`	Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In `predict.civis_ml`, the primary_key of the training task is used by default `primary_key = NA`. Use `primary_key = NULL` to explicitly indicate the data have no primary_key.
`excluded_columns`	Optional, a vector of columns which will be considered ineligible to be independent variables.
`alpha`	The regularization strength, must be a vector of floats of length n_targets or a single float. Larger values specify stronger regularization.
`fit_intercept`	Should an intercept term be included in the model. If `FALSE`, no intercept will be included, in this case the data are expected to already be centered.
`normalize`	If `TRUE`, the regressors will be normalized before fitting the model. `normalize` is ignored when `fit_intercept = FALSE`.
`max_iter`	Maximum number of iterations for conjugate gradient solver. For `sparse_cg` and `lsqr` solvers, the default value is predetermined. For the `sag` solver, the default value is 1000.
`tol`	Precision of the solution.
`solver`	Solver to use for the optimization problem. auto chooses the solver automatically based on the type of data. svd uses Singular Value Decomposition of X to compute the Ridge coefficients. More stable for singular matrices than `cholesky`. cholesky uses the standard decomposition to obtain a closed-form solution. sparse_cg uses the conjugate gradient solver. As an iterative algorithm, this solver is more appropriate than `cholesky` for large-scale data. lsqr uses the dedicated regularized least-squares routine. sag uses Stochastic Average Gradient descent. It also uses an iterative procedure, and is often faster than other solvers when both n_samples and n_features are large. Note that `sag` fast convergence is only guaranteed on features with approximately the same scale
`random_state`	The seed of the pseudo random number generator to use when shuffling the data. Used only when `solver = "sag"`.
`fit_params`	Optional, a mapping from parameter names in the model's `fit` method to the column names which hold the data, e.g. `list(sample_weight = 'survey_weight_column')`.
`cross_validation_parameters`	Optional, parameter grid for learner parameters, e.g. `list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3))` or `"hyperband"` for supported models.
`oos_scores_table`	Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".
`oos_scores_db`	Optional, the name of the database where the `oos_scores_table` will be created. If not provided, this will default to `database_name`.
`oos_scores_if_exists`	Optional, action to take if `oos_scores_table` already exists. One of `"fail"`, `"append"`, `"drop"`, or `"truncate"`. The default is `"fail"`.
`model_name`	Optional, the prefix of the Platform modeling jobs. It will have `" Train"` or `" Predict"` added to become the Script title.
`cpu_requested`	Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.
`memory_requested`	Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.
`disk_requested`	Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.
`notifications`	Optional, model status notifications. See `scripts_post_custom` for further documentation about email and URL notification.
`polling_interval`	Check for job completion every this number of seconds.
`verbose`	Optional, If `TRUE`, supply debug outputs in Platform logs and make prediction child jobs visible.
`civisml_version`	Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production

Value

A civis_ml object, a list containing the following elements:

`job`	job metadata from `scripts_get_custom`.
`run`	run metadata from `scripts_get_custom_runs`.
`outputs`	CivisML metadata from `scripts_list_custom_runs_outputs` containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.
`metrics`	Parsed CivisML output from `metrics.json` containing metadata from validation. A list containing the following elements: run list, metadata about the run. data list, metadata about the training data. model list, the fitted scikit-learn model with CV results. metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc). warnings list. data_platform list, training data location.
`model_info`	Parsed CivisML output from `model_info.json` containing metadata from training. A list containing the following elements: run list, metadata about the run. data list, metadata about the training data. model list, the fitted scikit-learn model. metrics empty list. warnings list. data_platform list, training data location.

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame: civis_ml(x = df, ...)
local csv file: civis_ml(x = "path/to/data.csv", ...)
file in Civis Platform: civis_ml(x = civis_file(1234))
table in Civis Platform: civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

Examples

## Not run: 
 data(ChickWeight)
 m <- civis_ml_sparse_ridge_regressor(ChickWeight, dependent_variable = "weight", alpha = 999)
 yhat <- fetch_oos_scores(m)

 # Grid search
 cv_params <- list(alpha = c(.001, .01, .1, 1))
 m <- civis_ml_sparse_ridge_regressor(ChickWeight,
   dependent_variable = "weight",
   cross_validation_parameters = cv_params,
   calibration = NULL)

# make a prediction job, storing in a redshift table
pred_info <- predict(m, newdata = civis_table("schema.table", "my_database"),
   output_table = "schema.scores_table")

## End(Not run)

[Package civis version 3.1.2 Index]