do_cv_direct {HTRX} | R Documentation |
Direct HTRX: k-fold cross-validation on short haplotypes
Description
Direct k-fold cross-validation used to compute the out-of-sample variance explained by selected features from HTRX. It can be applied to select haplotypes based on HTR, or select single nucleotide polymorphisms (SNPs).
Usage
do_cv_direct(
data_nosnp,
featuredata,
featurecap = dim(featuredata)[2],
usebinary = 1,
method = "simple",
criteria = "BIC",
gain = TRUE,
runparallel = FALSE,
mc.cores = 6,
fold = 10,
kfoldseed = 123,
verbose = FALSE
)
Arguments
data_nosnp |
a data frame with outcome (the outcome must be the first column), fixed covariates (for example, sex, age and the first 18 PCs) if there are, and without SNPs or haplotypes. |
featuredata |
a data frame of the feature data, e.g. haplotype data created by HTRX or SNPs.
These features exclude all the data in |
featurecap |
a positive integer which manually sets the maximum number of independent features.
By default, |
usebinary |
a non-negative number representing different models.
Use linear model if |
method |
the method used for data splitting, either |
criteria |
the criteria for model selection, either |
gain |
logical. If |
runparallel |
logical. Use parallel programming based on |
mc.cores |
an integer giving the number of cores used for parallel programming.
By default, |
fold |
a positive integer specifying how many folds the data should be split into for cross-validation. |
kfoldseed |
a positive integer specifying the seed used to
split data for k-fold cross validation. By default, |
verbose |
logical. If |
Details
Function do_cv_direct
directly performs k-fold cross-validation: features are
selected from the training set using a specified criteria
,
and the out-of-sample variance explained by the selected features are computed on the test set.
This function runs faster than do_cv
with large sim_times
, but may lose
some accuracy, and it doesn't return a fixed set of features.
Value
do_cv_direct
returns a list of the out-of-sample variance explained in each of the test set,
and the features selected in each of the k training sets.
References
Yang Y, Lawson DJ. HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype. Bioinformatics Advances 3.1 (2023): vbad038.
Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024).
Eforn, B. "Bootstrap methods: another look at the jackknife." The Annals of Statistics 7 (1979): 1-26.
Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics (1978): 461-464.
McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." (1973).
Akaike, Hirotugu. "A new look at the statistical model identification." IEEE transactions on automatic control 19.6 (1974): 716-723.
Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288.
Examples
## use dataset "example_hap1", "example_hap2" and "example_data_nosnp"
## "example_hap1" and "example_hap2" are
## both genomes of 8 SNPs for 5,000 individuals (diploid data)
## "example_data_nosnp" is an example dataset
## which contains the outcome (binary), sex, age and 18 PCs
## visualise the covariates data
## we will use only the first two covariates: sex and age in the example
head(HTRX::example_data_nosnp)
## visualise the genotype data for the first genome
head(HTRX::example_hap1)
## we perform HTRX on the first 4 SNPs
## we first generate all the haplotype data, as defined by HTRX
HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4],
HTRX::example_hap2[,1:4])
## If the data is haploid, please set
## HTRX_matrix=make_htrx(HTRX::example_hap1[,1:4],
## HTRX::example_hap1[,1:4])
## next compute the maximum number of independent features
featurecap=htrx_max(nsnp=4,cap=10)
## then perform HTRX using direct cross-validation
## If we want to compute the total variance explained
## we can set gain=FALSE in the above example
htrx_results <- do_cv_direct(HTRX::example_data_nosnp[,1:3],
HTRX_matrix,featurecap=featurecap,
usebinary=1,method="stratified",
criteria="lasso",gain=TRUE,
runparallel=FALSE,verbose=TRUE)