do_cv_direct {HTRX}R Documentation

Direct HTRX: k-fold cross-validation on short haplotypes


Direct k-fold cross-validation used to compute the out-of-sample variance explained by selected features from HTRX. It can be applied to select haplotypes based on HTR, or select single nucleotide polymorphisms (SNPs).


  featurecap = dim(featuredata)[2],
  usebinary = 1,
  method = "simple",
  criteria = "BIC",
  gain = TRUE,
  runparallel = FALSE,
  mc.cores = 6,
  fold = 10,
  kfoldseed = 123,
  verbose = FALSE



a data frame with outcome (the outcome must be the first column), fixed covariates (for example, sex, age and the first 18 PCs) if there are, and without SNPs or haplotypes.


a data frame of the feature data, e.g. haplotype data created by HTRX or SNPs. These features exclude all the data in data_nosnp, and will be selected using 2-step cross-validation.


a positive integer which manually sets the maximum number of independent features. By default, featurecap=40.


a non-negative number representing different models. Use linear model if usebinary=0, use logistic regression model via fastglm if usebinary=1 (by default), and use logistic regression model via glm if usebinary>1.


the method used for data splitting, either "simple" (default) or "stratified".


the criteria for model selection, either "BIC" (default), "AIC" or "lasso".


logical. If gain=TRUE (default), report the variance explained in addition to fixed covariates; otherwise, report the total variance explained by all the variables.


logical. Use parallel programming based on mclapply function from R package "parallel" or not. Note that for Windows users, mclapply doesn't work, so please set runparallel=FALSE (default).


an integer giving the number of cores used for parallel programming. By default, mc.cores=6. This only works when runparallel=TRUE.


a positive integer specifying how many folds the data should be split into for cross-validation.


a positive integer specifying the seed used to split data for k-fold cross validation. By default, kfoldseed=123.


logical. If verbose=TRUE, print out the inference steps. By default, verbose=FALSE.


Function do_cv_direct directly performs k-fold cross-validation: features are selected from the training set using a specified criteria, and the out-of-sample variance explained by the selected features are computed on the test set. This function runs faster than do_cv with large sim_times, but may lose some accuracy, and it doesn't return a fixed set of features.


do_cv_direct returns a list of the out-of-sample variance explained in each of the test set, and the features selected in each of the k training sets.


Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3.1 (2023): vbad038.

Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).

Eforn, B. "Bootstrap methods: another look at the jackknife." The Annals of Statistics 7 (1979): 1-26.

Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics (1978): 461-464.

McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).

Akaike, Hirotugu. "A new look at the statistical model identification." IEEE transactions on automatic control 19.6 (1974): 716-723.

Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.


## use dataset "example_hap1", "example_hap2" and "example_data_nosnp"
## "example_hap1" and "example_hap2" are
## both genomes of 8 SNPs for 5,000 individuals (diploid data)
## "example_data_nosnp" is an example dataset
## which contains the outcome (binary), sex, age and 18 PCs

## visualise the covariates data
## we will use only the first two covariates: sex and age in the example

## visualise the genotype data for the first genome

## we perform HTRX on the first 4 SNPs
## we first generate all the haplotype data, as defined by HTRX

## If the data is haploid, please set
## HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4],
##                       HTRX::example_hap1[,1:4])

## next compute the maximum number of independent features
## then perform HTRX using direct cross-validation
## If we want to compute the total variance explained
## we can set gain=FALSE in the above example

htrx_results <- do_cv_direct(HTRX::example_data_nosnp[,1:3],

[Package HTRX version 1.2.4 Index]