do_cv {HTRX}R Documentation

Two-stage HTRX: Model selection on short haplotypes

Description

Two-step cross-validation used to select the best HTRX model. It can be applied to select haplotypes based on HTR, or select single nucleotide polymorphisms (SNPs).

Usage

do_cv(
  data_nosnp,
  featuredata,
  train_proportion = 0.5,
  sim_times = 5,
  featurecap = dim(featuredata)[2],
  usebinary = 1,
  method = "simple",
  criteria = "BIC",
  gain = TRUE,
  nmodel = 3,
  dataseed = 1:sim_times,
  runparallel = FALSE,
  mc.cores = 6,
  fold = 10,
  kfoldseed = 123,
  returnwork = FALSE,
  verbose = FALSE
)

do_cv_step1(
  data_nosnp,
  featuredata,
  train_proportion = 0.5,
  featurecap = dim(featuredata)[2],
  usebinary = 1,
  method = "simple",
  criteria = "BIC",
  nmodel = 3,
  splitseed = 123,
  runparallel = FALSE,
  mc.cores = 6,
  verbose = FALSE
)

infer_step1(
  data_nosnp,
  featuredata,
  train,
  criteria = "BIC",
  featurecap = dim(featuredata)[2],
  usebinary = 1,
  nmodel = nmodel,
  runparallel = FALSE,
  mc.cores = 6,
  verbose = FALSE
)

infer_fixedfeatures(
  data_nosnp,
  featuredata,
  train = (1:nrow(data_nosnp))[-test],
  test,
  features,
  coefficients = NULL,
  gain = TRUE,
  usebinary = 1,
  R2only = FALSE,
  verbose = FALSE
)

Arguments

data_nosnp

a data frame with outcome (the outcome must be the first column with colnames(data_nosnp)[1]="outcome"), fixed covariates (for example, sex, age and the first 18 PCs) if there are, and without SNPs or haplotypes.

featuredata

a data frame of the feature data, e.g. haplotype data created by HTRX or SNPs. These features exclude all the data in data_nosnp, and will be selected using 2-step cross-validation.

train_proportion

a positive number between 0 and 1 giving the proportion of the training dataset when splitting data into 2 folds. By default, train_proportion=0.5.

sim_times

an integer giving the number of simulations in Step 1 (see details). By default, sim_times=5.

featurecap

a positive integer which manually sets the maximum number of independent features. By default, featurecap=40.

usebinary

a non-negative number representing different models. Use linear model if usebinary=0, use logistic regression model via fastglm if usebinary=1 (by default), and use logistic regression model via glm if usebinary>1.

method

the method used for data splitting, either "simple" (default) or "stratified".

criteria

the criteria for model selection, either "BIC" (default), "AIC" or "lasso".

gain

logical. If gain=TRUE (default), report the variance explained in addition to fixed covariates; otherwise, report the total variance explained by all the variables.

nmodel

a positive integer specifying the number of candidate models that the criterion selects. By default, nmodel=3.

dataseed

a vector of the seed that each simulation in Step 1 (see details) uses. The length of dataseed must be the same as sim_times. By default, dataseed=1:sim_times.

runparallel

logical. Use parallel programming based on mclapply function from R package "parallel" or not. Note that for Windows users, mclapply doesn't work, so please set runparallel=FALSE (default).

mc.cores

an integer giving the number of cores used for parallel programming. By default, mc.cores=6. This only works when runparallel=TRUE.

fold

a positive integer specifying how many folds the data should be split into for cross-validation.

kfoldseed

a positive integer specifying the seed used to split data for k-fold cross validation. By default, kfoldseed=123.

returnwork

logical. If returnwork=TRUE, return a vector of the maximum number of features that are assessed in each simulation, excluding the fixed covariates. This is used to assess how much computational 'work' is done in Step 1(2) of HTRX (see details). By default, returnwork=FALSE.

verbose

logical. If verbose=TRUE, print out the inference steps. By default, verbose=FALSE.

splitseed

a positive integer giving the seed of data split.

train

a vector of the indexes of the training data.

test

a vector of the indexes of the test data.

features

a character of the names of the fixed features, excluding the intercept.

coefficients

a vector giving the coefficients of the fixed features, including the intercept. If the fixed features don't have fixed coefficients, set coefficients=NULL (default).

R2only

logical. If R2only=TRUE, function infer_fixedfeatures only returns the variance explained in the test data. By default, R2only=FALSE.

Details

Function do_cv is the main function used for selecting haplotypes from HTRX or SNPs. It is a two-step algorithm and is used for alleviating overfitting.

Step 1: select candidate models. This is to address the model search problem, and is chosen to obtain a set of models more diverse than traditional bootstrap resampling.

(1) Randomly sample a subset (50 Specifically, when the outcome is binary, stratified sampling is used to ensure the subset has approximately the same proportion of cases and controls as the whole data;

(2) If criteria="AIC" or criteria="BIC", start from a model with fixed covariates (e.g. 18 PCs, sex and age), and perform forward regression on the subset, i.e. iteratively choose a feature (in addition to the fixed covariates) to add whose inclusion enables the model to explain the largest variance, and select s models with the lowest Akaike information criterion (AIC) or Bayesian Information Criteria (BIC) to enter the candidate model pool; If criteria="lasso", using least absolute shrinkage and selection operator (lasso) to directly select the best s models to enter the candidate model pool;

(3) repeat (1)-(2) B times, and select all the different models in the candidate model pool as the candidate models.

Step 2: select the best model using k-fold cross-validation.

(1) Randomly split the whole data into k groups with approximately equal sizes, using stratified sampling when the outcome is binary;

(2) In each of the k folds, use a fold as the validation dataset, a fold as the test dataset, and the remaining folds as the training dataset. Then, fit all the candidate models on the training dataset, and use these fitted models to compute the additional variance explained by features (out-of-sample variance explained) in the validation and test dataset. Finally, select the candidate model with the biggest average out-of-sample variance explained in the validation set as the best model, and report the out-of-sample variance explained in the test set.

Function do_cv_step1 is the Step 1 (1)-(2) described above. Function infer_step1 is the Step 1 (2) described above. Function infer_fixedfeatures is used to fit all the candidate models on the training dataset, and compute the additional variance explained by features (out-of-sample R2) in the test dataset, as described in the Step 2 (2) above.

Value

do_cv returns a list containing the best model selected, and the out-of-sample variance explained in each test set.

do_cv_step1 and infer_step1 return a list of three candidate models selected by a single simulation.

infer_fixedfeatures returns a list of the variance explained in the test set if R2only=TRUE, otherwise, it returns a list of the variance explained in the test set, the model including all the variables, and the null model, i.e. the model with outcome and fixed covariates only.

References

Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3.1 (2023): vbad038.

Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).

Eforn, B. "Bootstrap methods: another look at the jackknife." The Annals of Statistics 7 (1979): 1-26.

Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics (1978): 461-464.

McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).

Akaike, Hirotugu. "A new look at the statistical model identification." IEEE transactions on automatic control 19.6 (1974): 716-723.

Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.

Examples

## use dataset "example_hap1", "example_hap2" and "example_data_nosnp"
## "example_hap1" and "example_hap2" are
## both genomes of 8 SNPs for 5,000 individuals (diploid data)
## "example_data_nosnp" is an example dataset
## which contains the outcome (binary), sex, age and 18 PCs

## visualise the covariates data
## we will use only the first two covariates: sex and age in the example
head(HTRX::example_data_nosnp)

## visualise the genotype data for the first genome
head(HTRX::example_hap1)

## we perform HTRX on the first 4 SNPs
## we first generate all the haplotype data, as defined by HTRX
HTRX_matrix=make_htrx(HTRX::example_hap1[1:300,1:4],
                      HTRX::example_hap2[1:300,1:4])

## If the data is haploid, please set
## HTRX_matrix=make_htrx(HTRX::example_hap1[1:300,1:4],
##                       HTRX::example_hap1[1:300,1:4])

## then perform HTRX using 2-step cross-validation in a single small example
## to compute additional variance explained by haplotypes
## If you want to compute total variance explained, please set gain=FALSE
CV_results <- do_cv(HTRX::example_data_nosnp[1:300,1:2],
                    HTRX_matrix,train_proportion=0.5,
                    sim_times=1,featurecap=4,usebinary=1,
                    method="simple",criteria="BIC",
                    gain=TRUE,runparallel=FALSE,verbose=TRUE)

#This result would be more precise when setting larger sim_times and featurecap

[Package HTRX version 1.2.4 Index]