rapidopgs_multi {RapidoPGS} | R Documentation |
Compute PGS from GWAS summary statistics using Bayesian sum of single-effect (SuSiE) linear regression using z scores
Description
'rapidopgs_multi
computes PGS from a from GWAS summary statistics
using Bayesian sum of single-effect (SuSiE) linear regression using z scores
Usage
rapidopgs_multi(
data,
reference = NULL,
LDmatrices = NULL,
N = NULL,
build = c("hg19", "hg38"),
trait = c("cc", "quant"),
ncores = 1,
alpha.block = 1e-04,
alpha.snp = 0.01,
sd.prior = NULL,
ancestry = "EUR",
LDblocks = NULL
)
Arguments
data |
a data.table containing GWAS summary statistic dataset with all required information. |
reference |
a string representing the path to the directory containing the reference panel (eg. "../ref-data"). |
LDmatrices |
a string representing the path to the directory containing the pre-computed LD matrices. |
N |
a numeric indicating the number of individuals used to generate input GWAS dataset, or a string indicating the column name containing per-SNP sample size. |
build |
a string indicating the genome build. 'hg19' and 'hg38' are supported. Note that your LD matrices or reference panel should match the build. |
trait |
a string indicating if trait is a case-control ("cc") or quantitative ("quant"). |
ncores |
a numeric specifying the number of cores (CPUs) to be used. If using pre-computed LD matrices, one core is enough for best performance. |
alpha.block |
a numeric threshold for minimum P-value in LD blocks.
Blocks with minimum P above |
alpha.snp |
a numeric threshold for P-value pruning within LD block.
SNPs with P above |
sd.prior |
the prior specifies that BETA at causal SNPs follows a centred normal distribution with standard deviation sd.prior. If NULL (default) it will be automatically estimated (recommended). |
ancestry |
a string indicating the ancestral population (DEFAULT: "EUR", European). If using an alternative population, bear in mind that your LD matrices or reference must be from the same population. You'll also need to provide matching LD.blocks via the LDblocks argument. |
LDblocks |
a string indicating the path to an alternative LD block file in .RData format. Only required for non-European PGS. |
Details
This function will take a GWAS summary statistic dataset as an input,
will assign LD blocks to it, then use user-provided LD matrices or a preset
reference panel in Plink format to compute LD matrices for each block.
Then SuSiE method will be used to compute posterior probabilities of variants to be causal
and generate PGS weights by multiplying those posteriors by effect sizes (\beta
).
Unlike rapidopgs_single
, this approach will assume one or more causal variants.
The GWAS summary statistics file to compute PGS using our method must contain the following minimum columns, with these exact column names:
- CHR
Chromosome
- BP
Base position (in GRCh37/hg19).
- REF
Reference, or non-effect allele
- ALT
Alternative, or effect allele, the one
\beta
refers to- BETA
\beta
(or log(OR)), or effect sizes- SE
standard error of
\beta
- P
P-value for the association test
In addition, quantitative traits must have the following extra column:
- ALT_FREQ
Minor allele frequency.
Also, for quantitative traits, sample size must be supplied, either as a number, or indicating the column name, for per-SNP sample size datasets (see below). Other columns are allowed, and will be ignored.
Reference panel should be divided by chromosome, in Plink format.
Both reference panel and summary statistic dataset should be in GRCh37/hg19.
For 1000 Genomes panel, you can use create_1000G
function to set it up
automatically.
If prefer to use LD matrices, you must indicate the path to the directory where they are stored. They must be in RDS format, named LD_chrZ.rds (where Z is the 1-22 chromosome number). If you don't have LD matrices already, we recommend downloading those gently provided by Prive et al., at https://figshare.com/articles/dataset/European_LD_reference/13034123. These matrices were computed using for 1,054,330 HapMap3 variants based on 362,320 European individuals of the UK biobank.
Value
a data.table containing the sumstats dataset with computed PGS weights.
Author(s)
Guillermo Reales, Chris Wallace
Examples
## Not run:
ss <- data.table(
CHR=c("4","20","14","2","4","6","6","21","13"),
BP=c(1479959, 13000913, 29107209, 203573414, 57331393, 11003529, 149256398,
25630085, 79166661),
REF=c("C","C","C","T","G","C","C","G","T"),
ALT=c("A","T","T","A","A","A","T","A","C"),
BETA=c(0.012,0.0079,0.0224,0.0033,0.0153,0.058,0.0742,0.001,-0.0131),
SE=c(0.0099,0.0066,0.0203,0.0171,0.0063,0.0255,0.043,0.0188,0.0074),
P=c(0.2237,0.2316,0.2682,0.8477,0.01473,0.02298,0.08472,0.9573,0.07535))
PGS <- rapidopgs_multi(ss, reference = "ref-data/", N = 20000, build = "hg19", trait="cc", ncores=5)
## End(Not run)