data_split {HTRX} | R Documentation |
Data split
Description
kfold_split
splits data into k folds with equal sizes, which is used for cross-validation.
twofold_split
splits data into two folds, which samples the training set.
Both stratified sampling and simple sampling are allowed.
The details can be found in function do_cv
and do_cumulative_htrx
.
Usage
kfold_split(outcome, fold, method = "simple")
twofold_split(outcome, train_proportion = 0.5, method = "simple")
Arguments
outcome |
a vector of the variable (usually the outcome)
based on which the data is going to be stratified.
This only works when |
fold |
a positive integer specifying how many folds the data should be split into. |
method |
the method to be used for data split, either |
train_proportion |
a positive number between 0 and 1 giving
the proportion of the training dataset when splitting data into 2 folds.
By default, |
Details
Stratified sampling works only when the outcome
variable is binary (either 0 or 1),
and it ensures each fold has almost the same number of outcome=0
and outcome=1
.
Simple sampling randomly splits the data into k folds.
Two-fold data split is used to select candidate models in Step 1 of HTRX or cumulative HTRX, while k-fold data split is used for 10-fold cross-validation in Step 2 which aims at selecting the best model.
Value
Both functions return a list containing the indexes of different folds.
Examples
## create the binary outcome (20% prevalence)
outcome=rbinom(200,1,0.2)
## simple sampling (10 folds)
kfold_split(outcome,10)
## stratified sampling (10 folds)
kfold_split(outcome,10,"stratified")
## stratified sampling (2 folds, with 50% training data)
twofold_split(outcome,0.5,"stratified")