repfdr {repfdr} | R Documentation |
Bayes and local Bayes false discovery rate estimation for replicability analysis
Description
Estimate Bayes and local Bayes false discovery rates (FDRs) from multiple studies, for replicability analysis and for meta-analysis, as presented in Heller and Yekutieli (see reference below).
Usage
repfdr(pdf.binned.z, binned.z.mat,
non.null = c("replication", "meta-analysis",
"user.defined"),
non.null.rows = NULL,Pi.previous.result = NULL,
control = em.control(), clusters = NULL,
clusters.ldr.report=NULL, clusters.verbose=T)
Arguments
pdf.binned.z |
A 3-dimensional array which contains for each study (first dimension), the probabilities of a z-score to fall in the bin (second dimension), under each hypothesis status (third dimension). The third dimension can be of size 2 or 3, depending on the number of association states: if the association can be either null or non-null (e.g. only in one direction), the dimension is 2; if the association can be either null, or positive, or negative, the dimension is 3.
Element |
binned.z.mat |
A matrix of the bin numbers for each of the z-scores (rows) in each study (columns).
Element |
non.null |
Indicates the desired analysis: |
non.null.rows |
Vector of row indices in |
Pi.previous.result |
An optional Vector of probabilities for each association status. If |
control |
List of control parameters to pass to the EM algorithm. See |
clusters |
Used for performing analysis in each cluster, and than aggregating results together.Default value is |
clusters.ldr.report |
Sets whether local fdr values (available through the function |
clusters.verbose |
if set to |
Details
For N
studies, each examining the same M
features, the binned z-scores and the (estimated) probabilities under the null and non-null states in each study are given as input.
These inputs can be produced from the z-scores using the function ztobins
.
The function calls piem
for the computation of the probabilities for each vector of association status. The number of probabilies estimated is x^N
, where x=2,3
is the number of possible association states in each study.
The function calls ldr
for the computation of the conditional probability of each of the vectors of association status in the null set given the binned z-scores. The null set contains the rows in hconfigs(N,x)
that: are excluded from non.null.rows
if non.null
is user.defined
; that are non-zero if non.null
is meta-analysis
; that contain at most one 1 if non.null
is replication
and x=2
; that contain at most one 1 or one -1 if non.null
is replication
and x=3
.
The local Bayes FDR is estimated to be the sum of conditional probabilities in the null set for each feature. The empirical Bayes FDR is the average of all local Bayes FDRs that are at most the value of the local Bayes FDR for each feature. The list of discoveries at level q are all features with empirical Bayes FDR at most q.
If many studies are available, one may not be able to compute RepFDR directly at the original data. If however, different groups of studies are known to be independent (e.g., if a SNP is non null for studies 1,2 is independent of the SNP being non null in studies 3,4) one may Run RepFDR in each cluster seperatly and then aggregate the results. This is done by providing a vector for the clusters
argument, with an integer value stating the cluster membership for each study.
See the values section below for the results returned from this function, when partitioning the data to clusters.
See vignette('RepFDR') for a complete example, on how to run RepFDR in clusters.
Value
mat |
An |
Pi |
Vector of the estimated probabilities for each of the |
repfdr.mat.percluster |
Returned if |
repfdr.Pi.percluster |
Returned if |
ldr |
Returned if |
Author(s)
Ruth Heller, Shachar Kaufman, Shay Yaacoby, Barak Brill, Daniel Yekutieli.
References
Heller, R., & Yekutieli, D. (2014). Replicability analysis for genome-wide association studies. The Annals of Applied Statistics, 8(1), 481-498.
Heller, R., Yaacoby, S., & Yekutieli, D. (2014). repfdr: a tool for replicability analysis for genome-wide association studies. Bioinformatics, btu434.
Examples
#### Example 1: a simulation; each feature in each study has two association states,
#### null and positive, prior is known
#This example generates the Z scores for two studies, with 0.05 probability to have
# non - null signal in each study.
# The prior matrix is being pregenerated to show the optimal values.
# if this matrix was not supplied, the repfdr method would estimate it
# using an EM algorithm. See the next examples for estimating the prior as well using repfdr.
set.seed(1)
n = 2 #two studies
m=10000 # ten thounsand, SNPs
H_Study_1 = rbinom(m,1,prob = 0.05) #signal of 1, for SNPS with association in the first study
H_Study_2 = rbinom(m,1,prob = 0.05) #signal of 1, for SNPS with association in the second study
Zmat = matrix(rnorm(n*m),nrow = m) #generate matrix
#insert signal (mean shift of 3) for the first study
Zmat[which(H_Study_1==1),1] = Zmat[which(H_Study_1==1),1] + 4
#insert signal to the second study
Zmat[which(H_Study_2==1),2] = Zmat[which(H_Study_2==1),2] + 4
#estimate densities via ztobins:
ztobins_res = ztobins(Zmat,n.association.status = 2,plot.diagnostics = FALSE,n.bin= 100)
#writing out the prior explicitly. If this was not supplied,
#the repfdr would try to estimate this prior from the data.
Precomputed_Pi = matrix(NA,ncol = 3,nrow = 4)
Precomputed_Pi[,1] = c(0,1,0,1)
Precomputed_Pi[,2] = c(0,0,1,1)
Precomputed_Pi[,3] = c(0.95^2,0.95*0.05,0.95*0.05,0.05^2)
colnames(Precomputed_Pi) = c('Study 1','Study 2','Pi')
#run repfdr
repfdr_res = repfdr(ztobins_res$pdf.binned.z,
ztobins_res$binned.z.mat,
non.null = 'replication',
Pi.previous.result = Precomputed_Pi)
#The precomputed prior matrix. if this would not
repfdr_res$Pi
#local fdr0 and Fdr for each SNP
head(repfdr_res$mat)
Non_Null = which(H_Study_1 ==1 & H_Study_2 == 1)
Reported = which(repfdr_res$mat[,2] <= 0.05)
TP = length(intersect(Reported, Non_Null))
TP
FP = length(Reported) - TP
FP
FN = length(Non_Null - TP)
FN
#### Example 2: a simulation; each feature in each study has two association states,
#### null and positive, prior is estimated
## Not run:
# a) Replicablity analysis:
data(binned_zmat_sim) # this loads the binned z-scores as well as the (estimated) probabilities
# in each bin for each state
output.rep <- repfdr(pbz_sim, bz_sim, "replication")
BayesFdr.rep <- output.rep$mat[,"Fdr"]
Rej <- (BayesFdr.rep <= 0.05)
sum(Rej)
# which of the tests are true replicability findings? (we know this since the data was simulated)
data(hmat_sim)
true.rep <- apply(hmat_sim,1,function(y){ sum(y==1)>1 })
# Compute the false discovery proportion (FDP) for replicability:
sum(Rej * !true.rep) / sum(true.rep)
# we can use the previously calculated Pi for further computations (e.g meta-analysis):
Pi_sim <- output.rep$Pi
# b) meta-analysis:
output.meta <- repfdr(pbz_sim, bz_sim, "meta-analysis", Pi.previous.result = Pi_sim)
BayesFdr.meta <- output.meta$mat[,"Fdr"]
Rej <- (BayesFdr.meta <= 0.05)
sum(Rej)
# which of the tests are true association findings? (we know this since the data was simulated)
true.assoc <- rowSums(hmat_sim) >= 1
# Compute the false discovery proportion (FDP) for association:
sum(Rej * !true.assoc) / sum(true.assoc)
## End(Not run)
## Not run:
#### Example 3: SNPs data; each SNP in each study has three association states,
#### negative, null, or positive:
# load the bins of the z-scores and their probabilities.
download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData',
destfile = "binned_zmat.RData")
load(file = "binned_zmat.RData")
# can also be generated from SNPlocation - see ztobins documentation.
# load the prior probabilities for each association status vector.
data(Pi)
Pi # the proportions vector was computed using piem()
# with the following command: Pi <- piem(pbz, bz)$last.iteration
# a) replicablity analysis:
output.rep <- repfdr(pbz, bz, "replication",Pi.previous.result=Pi)
BayesFdr.rep <- output.rep$mat[,"Fdr"]
Rej <- sum(BayesFdr.rep <= 0.05)
sum(Rej)
# The posterior probabilities for the first five features with Bayes FDR at most 0.05:
post <- ldr(pbz,bz[order(BayesFdr.rep)[1:5],],Pi)
round(post,4)
# posteriors for a subset of the association status vectors can also be reported:
H <- hconfigs( dim(bz)[2], 3)
h.replicability = apply(H, 1, function(y) {sum(y == 1)> 1 | sum(y == -1) >1})
post <- ldr(pbz,bz[order(BayesFdr.rep)[1:5],],Pi,h.vecs= which(h.replicability==1))
round(post,4)
# b) meta-analysis:
output.meta <- repfdr(pbz, bz, "meta-analysis", Pi.previous.result = Pi)
BayesFdr.meta <- output.meta$mat[,"Fdr"]
Rej <- sum(BayesFdr.meta <= 0.05)
sum(Rej)
## End(Not run)
## manhattan plot (ploting can take a while):
# code for manhattan plot by Stephen Turner (see copyrights at the source code manhattan.r)
## Not run:
data(SNPlocations)
par(mfrow=c(2,1))
# Replication
manhattan(dataframe=cbind(SNPlocations,P=BayesFdr.rep),ymax=10.5,pch=20,
limitchromosomes=1:4,suggestiveline=-log(0.05,10),genomewideline=F,cex=0.25,
annotate=SNPlocations$SNP[BayesFdr.rep<=0.05],main="Replication")
# Association
manhattan(dataframe=cbind(SNPlocations,P=BayesFdr.meta),ymax=10.5,cex=0.25,
limitchromosomes=1:4,suggestiveline=-log(0.05,10),genomewideline=F,pch=20,
annotate=SNPlocations$SNP[BayesFdr.rep<=0.05],main="Meta-analysis")
par(mfrow=c(1,1))
## End(Not run)