RGWAS.multisnp {RAINBOWR} | R Documentation |
Testing multiple SNPs simultaneously for GWAS
Description
This function performs SNP-set GWAS (genome-wide association studies), which tests multiple SNPs (single nucleotide polymorphisms) simultaneously. The model of SNP-set GWAS is
y = X \beta + Q v + Z _ {c} u _ {c} + Z _ {r} u _ {r} + \epsilon,
where y
is the vector of phenotypic values,
X \beta
and Q v
are the terms of fixed effects,
Z _ {c} u _ {c}
and Z _ {c} u _ {c}
are the term of random effects and e
is the vector of residuals.
X \beta
indicates all of the fixed effects other than population structure, and often this term also plays
a role as an intercept. Q v
is the term to correct the effect of population structure.
Z _ {c} u _ {c}
is the term of polygenetic effects, and suppose that u _ {c}
follows the multivariate normal distribution whose variance-covariance
matrix is the genetic covariance matrix. u _ {c} \sim MVN (0, K _ {c} \sigma_{c}^{2})
.
Z _ {r} u _ {r}
is the term of effects for SNP-set of interest, and suppose that u _ {r}
follows the multivariate normal distribution whose variance-covariance
matrix is the Gram matrix (linear, exponential, or gaussian kernel)
calculated from marker genotype which belong to that SNP-set.
Therefore, u _ {r} \sim MVN (0, K _ {r} \sigma_{r}^{2})
.
Finally, the residual term is assumed to identically and independently follow
a normal distribution as shown in the following equation.
e \sim MVN (0, I \sigma_{e}^{2})
.
Usage
RGWAS.multisnp(
pheno,
geno,
ZETA = NULL,
package.MM = "gaston",
covariate = NULL,
covariate.factor = NULL,
structure.matrix = NULL,
n.PC = 0,
min.MAF = 0.02,
test.method = "LR",
n.core = 1,
parallel.method = "mclapply",
kernel.method = "linear",
kernel.h = "tuned",
haplotype = TRUE,
num.hap = NULL,
test.effect = "additive",
window.size.half = 5,
window.slide = 1,
chi0.mixture = 0.5,
gene.set = NULL,
map.gene.set = NULL,
weighting.center = TRUE,
weighting.other = NULL,
sig.level = 0.05,
method.thres = "BH",
plot.qq = TRUE,
plot.Manhattan = TRUE,
plot.method = 1,
plot.col1 = c("dark blue", "cornflowerblue"),
plot.col2 = 1,
plot.type = "p",
plot.pch = 16,
saveName = NULL,
main.qq = NULL,
main.man = NULL,
plot.add.last = FALSE,
return.EMM.res = FALSE,
optimizer = "nlminb",
thres = TRUE,
skip.check = FALSE,
verbose = TRUE,
verbose2 = FALSE,
count = TRUE,
time = TRUE
)
Arguments
pheno |
Data frame where the first column is the line name (gid). The remaining columns should be a phenotype to test. |
geno |
Data frame with the marker names in the first column. The second and third columns contain the chromosome and map position. Columns 4 and higher contain the marker scores for each line, coded as [-1, 0, 1] = [aa, Aa, AA]. |
ZETA |
A list of covariance (relationship) matrix (K: ZETA = list(A = list(Z = Z.A, K = K.A), D = list(Z = Z.D, K = K.D))
For example, K.A is additive relationship matrix for the covariance between lines, and K.D is dominance relationship matrix. |
package.MM |
The package name to be used when solving mixed-effects model. We only offer the following three packages:
"RAINBOWR", "MM4LMM" and "gaston". Default package is 'gaston'.
See more details at |
covariate |
A |
covariate.factor |
A |
structure.matrix |
You can use structure matrix calculated by structure analysis when there are population structure. You should not use this argument with n.PC > 0. |
n.PC |
Number of principal components to include as fixed effects. Default is 0 (equals K model). |
min.MAF |
Specifies the minimum minor allele frequency (MAF). If a marker has a MAF less than min.MAF, it is assigned a zero score. |
test.method |
RGWAS supports two methods to test effects of each SNP-set.
|
n.core |
Setting n.core > 1 will enable parallel execution on a machine with multiple cores. This argument is not valid when 'parallel.method = "furrr"'. |
parallel.method |
Method for parallel computation. We offer three methods, "mclapply", "furrr", and "foreach". When 'parallel.method = "mclapply"', we utilize When 'parallel.method = "furrr"', we utilize When 'parallel.method = "foreach"', we utilize We recommend that you use the option 'parallel.method = "mclapply"', but for Windows users, this parallelization method is not supported. So, if you are Windows user, we recommend that you use the option 'parallel.method = "foreach"'. |
kernel.method |
It determines how to calculate kernel. There are three methods.
So local genomic relation matrix is regarded as kernel. |
kernel.h |
The hyper parameter for gaussian or exponential kernel. If kernel.h = "tuned", this hyper parameter is calculated as the median of off-diagonals of distance matrix of genotype data. |
haplotype |
If the number of lines of your data is large (maybe > 100), you should set haplotype = TRUE. When haplotype = TRUE, haplotype-based kernel will be used for calculating -log10(p). (So the dimension of this gram matrix will be smaller.) The result won't be changed, but the time for the calculation will be shorter. |
num.hap |
When haplotype = TRUE, you can set the number of haplotypes which you expect. Then similar arrays are considered as the same haplotype, and then make kernel(K.SNP) whose dimension is num.hap x num.hap. When num.hap = NULL (default), num.hap will be set as the maximum number which reflects the difference between lines. |
test.effect |
Effect of each marker to test. You can choose "test.effect" from "additive", "dominance" and "additive+dominance". You also can choose more than one effect, for example, test.effect = c("additive", "aditive+dominance") |
window.size.half |
This argument decides how many SNPs (around the SNP you want to test) are used to calculated K.SNP. More precisely, the number of SNPs will be 2 * window.size.half + 1. |
window.slide |
This argument determines how often you test markers. If window.slide = 1, every marker will be tested. If you want to perform SNP set by bins, please set window.slide = 2 * window.size.half + 1. |
chi0.mixture |
RAINBOWR assumes the deviance is considered to follow a x chisq(df = 0) + (1 - a) x chisq(df = r). where r is the degree of freedom. The argument chi0.mixture is a (0 <= a < 1), and default is 0.5. |
gene.set |
If you have information of gene (or haplotype block), you can use it to perform kernel-based GWAS. You should assign your gene information to gene.set in the form of a "data.frame" (whose dimension is (the number of gene) x 2). In the first column, you should assign the gene name. And in the second column, you should assign the names of each marker, which correspond to the marker names of "geno" argument. |
map.gene.set |
Genotype map for 'gene.set' (list of haplotype blocks).
This is a data.frame with the haplotype block (SNP-set, or gene-set) names in the first column.
The second and third columns contain the chromosome and map position for each block.
The forth column contains the cumulative map position for each block, which can be computed by |
weighting.center |
In kernel-based GWAS, weights according to the Gaussian distribution (centered on the tested SNP) are taken into account when calculating the kernel if Rainbow = TRUE. If weighting.center = FALSE, weights are not taken into account. |
weighting.other |
You can set other weights in addition to weighting.center. The length of this argument should be equal to the number of SNPs. For example, you can assign SNP effects from the information of gene annotation. |
sig.level |
Significance level for the threshold. The default is 0.05. |
method.thres |
Method for detemining threshold of significance. "BH" and "Bonferroni are offered. |
plot.qq |
If TRUE, draw qq plot. |
plot.Manhattan |
If TRUE, draw manhattan plot. |
plot.method |
If this argument = 1, the default manhattan plot will be drawn. If this argument = 2, the manhattan plot with axis based on Position (bp) will be drawn. Also, this plot's color is changed by all chromosomes. |
plot.col1 |
This argument determines the color of the manhattan plot. You should substitute this argument as color vector whose length is 2. plot.col1[1] for odd chromosomes and plot.col1[2] for even chromosomes |
plot.col2 |
Color of the manhattan plot. color changes with chromosome and it starts from plot.col2 + 1 (so plot.col2 = 1 means color starts from red.) |
plot.type |
This argument determines the type of the manhattan plot. See the help page of "plot". |
plot.pch |
This argument determines the shape of the dot of the manhattan plot. See the help page of "plot". |
saveName |
When drawing any plot, you can save plots in png format. In saveName, you should substitute the name you want to save. When saveName = NULL, the plot is not saved. |
main.qq |
The title of qq plot. If this argument is NULL, trait name is set as the title. |
main.man |
The title of manhattan plot. If this argument is NULL, trait name is set as the title. |
plot.add.last |
If saveName is not NULL and this argument is TRUE, then you can add lines or dots to manhattan plots. However, you should also write "dev.off()" after adding something. |
return.EMM.res |
When return.EMM.res = TRUE, the results of equation of mixed models are included in the result of RGWAS. |
optimizer |
The function used in the optimization process. We offer "optim", "optimx", and "nlminb" functions. |
thres |
If thres = TRUE, the threshold of the manhattan plot is included in the result of RGWAS. When return.EMM.res or thres is TRUE, the results will be "list" class. |
skip.check |
As default, RAINBOWR checks the type of input data and modifies it into the correct format. However, it will take some time, so if you prepare the correct format of input data, you can skip this procedure by setting 'skip.check = TRUE'. |
verbose |
If this argument is TRUE, messages for the current steps will be shown. |
verbose2 |
If this argument is TRUE, welcome message will be shown. |
count |
When count is TRUE, you can know how far RGWAS has ended with percent display. |
time |
When time is TRUE, you can know how much time it took to perform RGWAS. |
Details
P-value for each SNP-set is calculated by performing the LR test or the score test (Lippert et al., 2014).
In the LR test, first, the function solves the multi-kernel mixed model and calaculates the maximum restricted log likelihood. Then it performs the LR test by using the fact that the deviance
D = 2 \times (LL _ {alt} - LL _ {null})
follows the chi-square distribution.
In the score test, the maximization of the likelihood is only performed for the null model. In other words, the function calculates the score statistic without solving the multi-kernel mixed model for each SNP-set. Then it performs the score test by using the fact that the score statistic follows the chi-square distribution.
Value
- $D
Dataframe which contains the information of the map you input and the results of RGWAS (-log10(p)) which correspond to the map. If there are more than one test.effects, then multiple lists for each test.effect are returned respectively.
- $thres
A vector which contains the information of threshold determined by FDR = 0.05.
- $EMM.res
This output is a list which contains the information about the results of "EMM" perfomed at first in regular GWAS. If you want to know details, see the description for the function "EMM1" or "EMM2".
References
Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci. 100(16): 9440-9445.
Yu, J. et al. (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 38(2): 203-208.
Kang, H.M. et al. (2008) Efficient Control of Population Structure in Model Organism Association Mapping. Genetics. 178(3): 1709-1723.
Endelman, J.B. (2011) Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome J. 4(3): 250.
Endelman, J.B. and Jannink, J.L. (2012) Shrinkage Estimation of the Realized Relationship Matrix. G3 Genes, Genomes, Genet. 2(11): 1405-1413.
Zhou, X. and Stephens, M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 44(7): 821-824.
Listgarten, J. et al. (2013) A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics. 29(12): 1526-1533.
Lippert, C. et al. (2014) Greater power and computational efficiency for kernel-based association testing of sets of genetic variants. Bioinformatics. 30(22): 3206-3214.
Examples
### Import RAINBOWR
require(RAINBOWR)
### Load example datasets
data("Rice_Zhao_etal")
Rice_geno_score <- Rice_Zhao_etal$genoScore
Rice_geno_map <- Rice_Zhao_etal$genoMap
Rice_pheno <- Rice_Zhao_etal$pheno
Rice_haplo_block <- Rice_Zhao_etal$haploBlock
### View each dataset
See(Rice_geno_score)
See(Rice_geno_map)
See(Rice_pheno)
See(Rice_haplo_block)
### Select one trait for example
trait.name <- "Flowering.time.at.Arkansas"
y <- as.matrix(Rice_pheno[, trait.name, drop = FALSE])
### Remove SNPs whose MAF <= 0.05
x.0 <- t(Rice_geno_score)
MAF.cut.res <- MAF.cut(x.0 = x.0, map.0 = Rice_geno_map)
x <- MAF.cut.res$x
map <- MAF.cut.res$map
### Estimate genomic relationship matrix (GRM)
K.A <- calcGRM(genoMat = x)
### Modify data
modify.data.res <- modify.data(pheno.mat = y, geno.mat = x, map = map,
return.ZETA = TRUE, return.GWAS.format = TRUE)
pheno.GWAS <- modify.data.res$pheno.GWAS
geno.GWAS <- modify.data.res$geno.GWAS
ZETA <- modify.data.res$ZETA
### View each data for RAINBOWR
See(pheno.GWAS)
See(geno.GWAS)
str(ZETA)
### Perform SNP-set GWAS (by regarding 21 SNPs as one SNP-set)
SNP_set.res <- RGWAS.multisnp(pheno = pheno.GWAS, geno = geno.GWAS,
ZETA = ZETA, n.PC = 4, test.method = "LR",
kernel.method = "linear", gene.set = NULL,
test.effect = "additive", window.size.half = 10,
window.slide = 21, package.MM = "gaston",
parallel.method = "mclapply",
skip.check = TRUE, n.core = 2)
See(SNP_set.res$D) ### Column 4 contains -log10(p) values for markers
### Perform SNP-set GWAS 2 (by regarding 11 SNPs as one SNP-set with sliding window)
### It will take almost 2 minutes...
SNP_set.res2 <- RGWAS.multisnp(pheno = pheno.GWAS, geno = geno.GWAS,
ZETA = ZETA, n.PC = 4, test.method = "LR",
kernel.method = "linear", gene.set = NULL,
test.effect = "additive", window.size.half = 5,
window.slide = 1, package.MM = "gaston",
parallel.method = "mclapply",
skip.check = TRUE, n.core = 2)
See(SNP_set.res2$D) ### Column 4 contains -log10(p) values for markers
### Perform haplotype-block GWAS (by using the list of haplotype blocks estimated by PLINK)
haplo_block.res <- RGWAS.multisnp(pheno = pheno.GWAS, geno = geno.GWAS,
ZETA = ZETA, n.PC = 4, test.method = "LR",
kernel.method = "linear", gene.set = Rice_haplo_block,
test.effect = "additive", package.MM = "gaston",
parallel.method = "mclapply",
skip.check = TRUE, n.core = 2)
See(haplo_block.res$D) ### Column 4 contains -log10(p) values for markers