do_cv_direct {HTRX}R Documentation

Direct HTRX: k-fold cross-validation on short haplotypes

Description

Direct k-fold cross-validation used to compute the out-of-sample variance explained by selected features from HTRX. It can be applied to select haplotypes based on HTR, or select single nucleotide polymorphisms (SNPs).

Usage

do_cv_direct(
  data_nosnp,
  featuredata,
  featurecap = dim(featuredata)[2],
  usebinary = 1,
  method = "simple",
  criteria = "BIC",
  gain = TRUE,
  runparallel = FALSE,
  mc.cores = 6,
  fold = 10,
  kfoldseed = 123,
  verbose = FALSE
)

Arguments

data_nosnp

a data frame with outcome (the outcome must be the first column), fixed covariates (for example, sex, age and the first 18 PCs) if there are, and without SNPs or haplotypes.

featuredata

a data frame of the feature data, e.g. haplotype data created by HTRX or SNPs. These features exclude all the data in data_nosnp, and will be selected using 2-step cross-validation.

featurecap

a positive integer which manually sets the maximum number of independent features. By default, featurecap=40.

usebinary

a non-negative number representing different models. Use linear model if usebinary=0, use logistic regression model via fastglm if usebinary=1 (by default), and use logistic regression model via glm if usebinary>1.

method

the method used for data splitting, either "simple" (default) or "stratified".

criteria

the criteria for model selection, either "BIC" (default), "AIC" or "lasso".

gain

logical. If gain=TRUE (default), report the variance explained in addition to fixed covariates; otherwise, report the total variance explained by all the variables.

runparallel

logical. Use parallel programming based on mclapply function from R package "parallel" or not. Note that for Windows users, mclapply doesn't work, so please set runparallel=FALSE (default).

mc.cores

an integer giving the number of cores used for parallel programming. By default, mc.cores=6. This only works when runparallel=TRUE.

fold

a positive integer specifying how many folds the data should be split into for cross-validation.

kfoldseed

a positive integer specifying the seed used to split data for k-fold cross validation. By default, kfoldseed=123.

verbose

logical. If verbose=TRUE, print out the inference steps. By default, verbose=FALSE.

Details

Function do_cv_direct directly performs k-fold cross-validation: features are selected from the training set using a specified criteria, and the out-of-sample variance explained by the selected features are computed on the test set. This function runs faster than do_cv with large sim_times, but may lose some accuracy, and it doesn't return a fixed set of features.

Value

do_cv_direct returns a list of the out-of-sample variance explained in each of the test set, and the features selected in each of the k training sets.

References

Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3.1 (2023): vbad038.

Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).

Eforn, B. "Bootstrap methods: another look at the jackknife." The Annals of Statistics 7 (1979): 1-26.

Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics (1978): 461-464.

McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).

Akaike, Hirotugu. "A new look at the statistical model identification." IEEE transactions on automatic control 19.6 (1974): 716-723.

Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.

Examples

## use dataset "example_hap1", "example_hap2" and "example_data_nosnp"
## "example_hap1" and "example_hap2" are
## both genomes of 8 SNPs for 5,000 individuals (diploid data)
## "example_data_nosnp" is an example dataset
## which contains the outcome (binary), sex, age and 18 PCs

## visualise the covariates data
## we will use only the first two covariates: sex and age in the example
head(HTRX::example_data_nosnp)

## visualise the genotype data for the first genome
head(HTRX::example_hap1)

## we perform HTRX on the first 4 SNPs
## we first generate all the haplotype data, as defined by HTRX
HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4],
                      HTRX::example_hap2[,1:4])

## If the data is haploid, please set
## HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4],
##                       HTRX::example_hap1[,1:4])

## next compute the maximum number of independent features
featurecap=htrx_max(nsnp=4,cap=10)
## then perform HTRX using direct cross-validation
## If we want to compute the total variance explained
## we can set gain=FALSE in the above example

htrx_results <- do_cv_direct(HTRX::example_data_nosnp[,1:3],
                             HTRX_matrix,featurecap=featurecap,
                             usebinary=1,method="stratified",
                             criteria="lasso",gain=TRUE,
                             runparallel=FALSE,verbose=TRUE)


[Package HTRX version 1.2.4 Index]