The CBDA.pipeline() function comprises all the input specifications to run a set M of subsamples from the Big Data [Xtemp, Ytemp]. We assume that the Big Data is already clean and harmonized. This version 1.0.0 is fully tested ONLY on continuous features Xtemp and binary outcome Ytemp.

CBDA.pipeline(job_id, Ytemp, Xtemp, label = "CBDA_package_test", alpha = 0.2, Kcol_min = 5, Kcol_max = 15, Nrow_min = 30, Nrow_max = 50, misValperc = 0, M = 3000, N_cores = 1, top = 1000, workspace_directory = setwd(tempdir()), max_covs = 100, min_covs = 5, algorithm_list = c("SL.glm", "SL.xgboost", "SL.glmnet", "SL.svm", "SL.randomForest", "SL.bartMachine"))

`job_id` |
This is the ID for the job generator in the LONI pipeline interface |

`Ytemp` |
This is the output variable (vector) in the original Big Data |

`Xtemp` |
This is the input variable (matrix) in the original Big Data |

`label` |
This is the label appended to RData workspaces generated within the CBDA calls |

`alpha` |
Percentage of the Big Data to hold off for Validation |

`Kcol_min` |
Lower bound for the percentage of features-columns sampling (used for the Feature Sampling Range - FSR) |

`Kcol_max` |
Upper bound for the percentage of features-columns sampling (used for the Feature Sampling Range - FSR) |

`Nrow_min` |
Lower bound for the percentage of cases-rows sampling (used for the Case Sampling Range - CSR) |

`Nrow_max` |
Upper bound for the percentage of cases-rows sampling (used for the Case Sampling Range - CSR) |

`misValperc` |
Percentage of missing values to introduce in BigData (used just for testing, to mimic real cases). |

`M` |
Number of the BigData subsets on which perform Knockoff Filtering and SuperLearner feature mining |

`N_cores` |
Number of Cores to use in the parallel implementation (default is set to 1 core) |

`top` |
Top predictions to select out of the M (must be < M, optimal ~0.1*M) |

`workspace_directory` |
Directory where the results and workspaces are saved (set by default to tempdir()) |

`max_covs` |
Top features to display and include in the Validation Step where nested models are tested |

`min_covs` |
Minimum number of top features to include in the initial model for the Validation Step (it must be greater than 2) |

`algorithm_list` |
List of algorithms/wrappers used by the SuperLearner. By default is set to the following list algorithm_list <- c("SL.glm","SL.xgboost", "SL.glmnet","SL.svm","SL.randomForest","SL.bartMachine") |

CBDA object with validation results and 3 RData workspaces

