R: Bootstrap QTL analysis for accurate effect size estimation

BootstrapQTL {BootstrapQTL}

R Documentation

Bootstrap QTL analysis for accurate effect size estimation

Description

Performs cis-QTL mapping using MatrixEQTL then performs a bootstrap analysis to obtain unbiased effect size estimates for traits with significant evidence of genetic regulation correcting for the "Winner's Curse" arising from lead-SNP selection.

Usage

BootstrapQTL(snps, gene, snpspos, genepos, cvrt = SlicedData$new(),
  n_bootstraps = 200, n_cores = 1, eGene_detection_file_name = NULL,
  bootstrap_file_directory = NULL, cisDist = 1e+06,
  local_correction = "bonferroni", global_correction = "fdr",
  correction_type = "shrinkage", errorCovariance = numeric(),
  useModel = modelLINEAR, eigenMT_tests_per_gene = NULL)

Arguments

`snps`	`SlicedData` object containing genotype information used as input into `Matrix_eQTL_main`.
`gene`	`SlicedData` object containing gene expression information used as input into `Matrix_eQTL_main`.
`snpspos`	`data.frame` object with information about SNP locations. Used in conjunction with `'genespos'` and `'cisDist'` to determine SNPs in cis of each gene. Must have three columns: 'snpid' describing the name of the SNP and corresponding to rows in the 'snps' matrix. 'chr' describing the chromosome for each SNP. 'pos' describing the position of the SNP on the chromosome.
`genepos`	`data.frame` object with information about transcript locations. Used in conjunction with `'snpspos'` and `'cisDist'` to determine SNPs in cis of each gene. Must have four columns: 'geneid' describing the name of the gene and corresponding to rows in the 'gene' matrix. 'chr' describing the chromosome for each SNP. 'left' describing the start position of the transcript. 'right' describing the end position of the transcript. Note that `Matrix_eQTL_main` tests all variants within `cisDist` of the start or end of the gene. If you wish instead to test all variants within `cisDist` of the transcription start site, you should specify this location in both the 'left' and 'right' columns of the `genepos` data.frame. Similarly, when analysing a molecular phenotype that have a single chromosomal position then the 'left' and 'right' columns should both contain the same position.
`cvrt`	`SlicedData` object containing covariate information used as input into `Matrix_eQTL_main`. Argument can be ignored in the case of no covariates.
`n_bootstraps`	number of bootstraps to run.
`n_cores`	number of cores to parallise the bootstrap procedure over.
`eGene_detection_file_name`	`character`, `connection` or `NULL`. File to save local cis associations to in the eGene detection analysis. Corresponds to the `output_file_name.cis` argument in `Matrix_eQTL_main`. If a file with this name exists it is overwritten, if `NULL` output is not saved to file.
`bootstrap_file_directory`	`character` or `NULL`. If not `NULL`, files will be saved in this directory storing local cis associations for the bootstrap eGene detection group (detection_bootstrapnumber.txt) and local cis associations the bootstrap left-out eGene effect size estimation group (estimation_bootstrapnumber.txt). Estimation group files will only be saved where signficant eGenes are also significant in the bootstrap detection group (see Details). Corresponds to the `output_file_name.cis` argument in the respective calls to `Matrix_eQTL_main`. Files in this directory will be overwritten if they already exist.
`cisDist`	`numeric`. Argument to `Matrix_eQTL_main` controlling the maximum distance from a gene to consider tests for eQTL mapping.
`local_correction`	multiple testing correction method to use when correcting p-values across all SNPs at each gene (see EQTL mapping section in Details). Can be a method specified in `p.adjust.methods`, "qvalue" for the `qvalue` package, or "eigenMT" if EigenMT has been used to estimate the number effective independent tests (see `eigenMT_tests_per_gene`).
`global_correction`	multiple testing correction method to use when correcting p-values across all genes after performing local correction (see EQTL mapping section in Details). Must be a method specified in `p.adjust.methods` or "qvalue" for the `qvalue` package.
`correction_type`	`character`. One of "shrinkage", "out_of_sample" or "weighted". Determines which Winner's Curse correction method is used (see Details).
`errorCovariance`	`numeric matrix` argument to `Matrix_eQTL_main` specifying the error covariance.
`useModel`	`integer` argument to `Matrix_eQTL_main` specifying the type of model to fit between each SNP and gene. Should be one of `modelLINEAR`, `modelANOVA`, or `modelLINEAR_CROSS`.
`eigenMT_tests_per_gene`	`data.frame` containing the number of effective independent tests for each gene estimated by the EigenMT (https://github.com/joed3/eigenMT). Ignore unless `'local_correction="eigenMT"'`.

Details

Although the package interface and documentation describe the use of BootstrapQTL for cis-eQTL mapping, the package can be applied to any QTL study of quantitative traits with chromosomal positions, for example cis-QTL mapping of epigenetic modifications. Any matrix of molecular trait data can be provided to the 'gene' argument provided a corresponding 'genepos' 'data.frame' detailing the chromosomal positions of each trait is provided.

Cis-eQTL mapping:

EQTL mapping is performed using the MatrixEQTL package. A three step hieararchical multiple testing correction procedure is used to determine significant eGenes and eSNPs. At the first step, nominal p-values from MatrixEQTL for all cis-SNPs are adjusted for each gene separately using the method specified in the 'local_correction' argument (Bonferroni correction by default). In the second step, the best adjusted p-value is taken for each gene, and this set of locally adjusted p-values is corrected for multiple testing across all genes using the methods pecified in the 'global_correction' argument (FDR correction by default). In the third step, an eSNP significance threshold on the locally corrected p-values is determined as the locally corrected p-value corresponding to the globally corrected p-value threshold of 0.05.

A gene is considered a significant eGene if its globally corrected p-value is < 0.05, and a SNP is considered a significant eSNP for that eGene if its locally corrected p-value < the eSNP significance threshold.

The default settings for 'local_correction' and 'global_correction' were found to best control eGene false discovery rate without sacrificing sensitivity (see citation).

Winner's Curse correction:

EQTL effect sizes of significant eSNPs on significant eGenes are typically overestimated when compared to replication datasets (see citation). BootstrapEQTL removes this overestimation by performing a bootstrap procedure after eQTL mapping.

Three Winner's Curse correction methods are available: the Shrinkage method, the Out of Sample method, and the Weighted Estimator method. All three methods work on the same basic principle of performing repeated sample bootstrapping to partition the dataset into two groups: an eQTL detection group comprising study samples select via random sampling with replacement, and an eQTL effect size estimation group comprising the remaining samples not selected via the random sampling. The default estimator, 'correction_type = "shrinkage"', provided the most accurate corrected effect sizes in our simulation study (see citation).

The shrinkage method ("shrinkage" in 'correction_type') corrects for the Winner's Curse by measuring the average difference between the eQTL effect size in the bootstrap detection group and the bootstrap estimation group, then subtracting this difference from the naive eQTL effect size estimate obtained from the eGene detection analysis prior to the bootstrap procedure.

The out of sample method ("out_of_sample" in 'correction_type') corrects for the Winner's Curse by taking the average eQTL effect size across bootstrap estimation groups as an unbiased effect size estimate.

The weighted estimator method ("weighted" in 'correction_type') corrects for the Winner's Curse by taking a weighted average of the nominal estimate of the eQTL effect size and the average of eQTL effect sizes across the bootstrap estimation groups: 0.368 * naive_estimate + 0.632 * mean(bootstrap estimation group effect sizes).

In all three methods bootstrap effect sizes only contribute to the Winner's Curse correction if the corresponding eSNP is significantly associated with the eGene in the bootstrap detection group (locally corrected bootstrap P-value < eSNP significance threshold determing in the eQTL mapping step).

Note that eQTLs may not remain significant in all bootstraps, so the effective number of bootstraps used to obtain the Winner's Curse estimate will typically be lower than the number of bootstraps specified in 'n_bootstraps'. The number of bootstraps that were significant for each eQTL are reported in the 'correction_boots' column of the returned table.

Winner's Curse corrected effect sizes

The user should be aware that ability to correct for Winner's Curse can vary across significant eQTLs depending on their statistical power (i.e. minor allele frequency, true effect size, and study sample size). Users should be skeptical of corrected effect sizes that are larger than the nominal effect sizes estimated by MatrixEQTL, which likely reflects low power for eQTL detection rather than an underestimated effect size.

Bootstrap warning messages:

It is possible for bootstrap analyses to fail due to the reduced sample sizes of the bootstrap detection and bootstrap estimation groups. For example, the bootstrap resampling may lead to an detection or estimation groups in which all individuals are homozygous for an eSNP or have no variance in their supplied covariates (e.g. the estimation group may comprise individuals all of the same sex). In this case the bootstrap will fail for all eQTLs since MatrixEQTL will be unable to perform the model fitting.

Failed bootstraps are reported after the bootstrap procedure in a series of warning messages indicating the number of bootstrap failures grouped by the reason for the bootstrap failure.

Value

A data.frame (or data.table if the user has the library loaded) containing the results for each significant eQTL:

'eGene':: The eQTL eGene.
'eSNP':: The eQTL eSNP.
'statistic':: The test statistic for the association between the eGene and eSNP.
'nominal_beta':: The eQTL effect size for the eGene-eSNP pair estimated by MatrixEQTL.
'corrected_beta':: The eQTL effect size after adjustment for the winners_curse.
'winners_curse':: The amount of effect size overestimation determined by the bootstrap analysis (See Details).
'correction_boots':: The number of bootstraps that contributed to the estimation of the winners_curse, i.e. the number of bootstraps in which the eSNP remained significantly associated with the eGene (see Details).
'nominal_pval':: The p-value for the eGene-eSNP pair from the MatrixEQTL analysis.
'eSNP_pval':: The locally corrected p-value for the eGene-eSNP pair (see Details).
'eGene_pval':: The globally corrected p-value for the eGene based on its top eSNP (see Details).

Examples

# Locations for example data from the MatrixEQTL package
base.dir = find.package('MatrixEQTL');
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

# Load the SNP data
snps = SlicedData$new();
snps$fileDelimiter = "\t";      # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);

# Load the data detailing the position of each SNP
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);

# Load the gene expression data
gene = SlicedData$new();
gene$fileDelimiter = "\t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);

# Load the data detailing the position of each gene
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);

# Load the covariates data
cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels
if(length(covariates_file_name)>0) {
  cvrt$LoadFile(covariates_file_name);
}

# Run the BootstrapQTL analysis
eQTLs <- BootstrapQTL(snps, gene, snpspos, genepos, cvrt, n_bootstraps=10, n_cores=2)

[Package BootstrapQTL version 1.0.5 Index]