splits_selection {APML} | R Documentation |

Split dataset into training data and testing data and select variables based on relative importance.

splits_selection(data,split_ratio,split_seed, feature_model,imbalance,nfolds, RAN_type,RAN.seed,smote.seed, xcol_enter,distribution)

`data` |
A data.frame used to build models |

`split_ratio` |
A numeric value indicating the ratio of total rows contained in each split. Must less than 1 |

`split_seed` |
Random seed for splitting |

`feature_model` |
Name of model for feature selection. Currently, only allow "gbm" for gradient boosted tree, and "rf" for random forest |

`imbalance` |
Logical or "SMOTE"(for categorical response). True for balancing training data class counts via over/under-sampling when building the model. "SMOTE" for applying SMOTE and returning SMOTE training data. |

`nfolds` |
Number of folds for K-fold cross-validation. Default:5. |

`RAN_type` |
"both", "binominal" or "normal". "both" for generating both binominal and normal random terms for feature selection. "binominal" or "normal" only generate one specific type of random term. Categorical or continuous variables with relative importance greater than corresponding random term(s) will be selected. |

`RAN.seed` |
Random seed for random term(s) |

`smote.seed` |
Random seed for SMOTE. Only used if argument "imbalance"="SMOTE" |

`xcol_enter` |
A character vector of variables are required to enter the model, also called "forced entry". If xcol_enter contains all independent variables' names, it will not use random terms to select variables. |

`distribution` |
Distribution type. Must be one of: "AUTO", "bernoulli", "quasibinomial", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom". Defaults to AUTO. |

This function applys a technique to use random term to select variables. We consider variables with relative importance greater than random term as truly important variables.

`importance` |
A data.frame containing the relative importance scores of selected variables. |

`train_data` |
Training dataset. If "imbalance"="SMOTE", it returns the SMOTE training set. |

`test_data` |
Testing dataset. |

`raw_traindata` |
Same training dataset. If "imbalance"="SMOTE", it returns the original training set before SMOTE. |

This function is based on h2o package. In order to run this function, we need to run h2o.init() before using this function. The response variable should be the first column.

library(survival) library(h2o) library(performanceEstimation) data("lung") attach(lung) data <- datatrans(lung,factor_dummy = 'dummy',rescale = TRUE) data <- data[,c(3,1,2,4:14)] h2o.init() selection <- splits_selection(data,imbalance = 'SMOTE') h2o.shutdown(prompt=FALSE) Sys.sleep(2)

[Package *APML* version 0.0.3 Index]