data_split {HTRX}R Documentation

Data split

Description

kfold_split splits data into k folds with equal sizes, which is used for cross-validation. twofold_split splits data into two folds, which samples the training set. Both stratified sampling and simple sampling are allowed. The details can be found in function do_cv and do_cumulative_htrx.

Usage

kfold_split(outcome, fold, method = "simple")

twofold_split(outcome, train_proportion = 0.5, method = "simple")

Arguments

outcome

a vector of the variable (usually the outcome) based on which the data is going to be stratified. This only works when method="stratified".

fold

a positive integer specifying how many folds the data should be split into.

method

the method to be used for data split, either "simple" (default) or "stratified".

train_proportion

a positive number between 0 and 1 giving the proportion of the training dataset when splitting data into 2 folds. By default, train_proportion=0.5.

Details

Stratified sampling works only when the outcome variable is binary (either 0 or 1), and it ensures each fold has almost the same number of outcome=0 and outcome=1.

Simple sampling randomly splits the data into k folds.

Two-fold data split is used to select candidate models in Step 1 of HTRX or cumulative HTRX, while k-fold data split is used for 10-fold cross-validation in Step 2 which aims at selecting the best model.

Value

Both functions return a list containing the indexes of different folds.

Examples

## create the binary outcome (20% prevalence)
outcome=rbinom(200,1,0.2)

## simple sampling (10 folds)
kfold_split(outcome,10)

## stratified sampling (10 folds)
kfold_split(outcome,10,"stratified")

## stratified sampling (2 folds, with 50% training data)
twofold_split(outcome,0.5,"stratified")

[Package HTRX version 1.2.4 Index]