R: Gene-Based Segregation Test

GESE {GESE}

R Documentation

Gene-Based Segregation Test

Description

Computes the gene-based segregation information and tests for family-based sequencing data.

Usage

GESE(pednew, variantInformation, dbSize, dataPed, mapInfo, 
threshold = 1e-7, onlySeg = FALSE, familyWeight = NA )

Arguments

`pednew`	A data frame of the complete pedigree information for all families in the dataset. The required column names of this data frame include: FID (family ID), IID (individual ID, must be of class character), faID (father ID, NA if unavailable), moID (mother ID, NA if unavailable), and sex.
`variantInformation`	A data frame containing the information for all the variants satisfying the same filtering criteria in the chosen reference genome. It should include at least three columns with these names: SNP (unique SNP ID), GENE (gene name), MAF (minor allele frequency for the variant in reference database for the corresponding population).
`dbSize`	An integer indicating the sample size of the reference database used.
`dataPed`	A data frame in the `raw` file format generated by PLINK. The number of rows equal the number of subjects in the data and the number of columns equas the number of markers M + 6. The first six columns with specific column names include: the Family ID (FID), Individual ID (IID), father ID(PAT), mother ID (MAT), sex (SEX) and affection status (PHENOTYPE). The rest of the columns containing the genotypes for the variants listed in the coreesponding `mapInfo` file. It is also important to make sure that the recoding is with respect to the minor allele in the population. The affection status of this file will be used as the phenotype.
`mapInfo`	A data frame that contains at least two columns (required column names): variant ID (SNP) and Gene name (GENE). The number of rows equal to the number of SNPs/markers to be considered (M).
`threshold`	Specifies the precision needed to be reached for significant p-values. Default value is 1e-7.
`onlySeg`	True if only the segregation information (number of pedigrees segregating in each gene) is needed, else FALSE (DEFUALT), which computes the GESE p-values too.
`familyWeight`	An optional data frame. It gives the weight for the families. If it is NA, no weighting scheme is used. Otherwise, its dimenstion could be (number of families)x(number of genes+1) or (number of families)x2. The first column should be family name (column name FID). If the weights for the families are the same for all the genes, the second column should just be weight (columns name "weight"), otherwise the second column and above should be the gene names (columns names are corresponding GENE names).

Details

This is the main function in the GESE package. The gene-based segregation tests (GESE) described in Qiao et al (2016) is a segregation-based test extending the work of Bureau et al (2014) by computing the marginal probability of segregation events within a gene. The first step in this function is to trim the families such that only one lineage (with the most possible number of cases) is included (i.e. for any subject, only the information of either the parental pedigree or the maternal pedigree would be included). In addition, if multiple founder cases are present, remove the (smallest set of) founder(s) that are unrelated most other sequenced subjects. Then this function computes the gene-based segregating information and p-values for multiple families. If only the segregation information (number of families segregating in each gene) is needed, set onlySeg = TRUE. If different family weights will be used to boost the power, assign the weights to familyWeight parameter.

Value

`segregation`	a data frame containing the information about whether each gene is segregating in each family. The number of columns equals the number of families +3. The last column is the number of families the gene is segregating in. The number of rows equals the number of genes. Only this data frame and `varSeg` will be returned if onlySeg is set to TRUE.
`varSeg`	a data frame containing the information about whether each variant is segregating in each family. The number of columns equals the number of families +3. The last column is the number of families the variant is segregating in. The number of rows equals the number of variants. Only this data frame and `segregation` will be returned if onlySeg is set to TRUE.
`results`	This is available when onlySeg = FALSE. The datat frame contains the columns: GENE (gene name), obs_prob (the observed segregating probability for the gene), pvalue (gene-based p-value for GESE), numSim (The number of simulations used to compute the p-value if resampling-based method is used), N_seg (the number of families that are segregating in the gene). If familyWeight is not NA, obs_weight_stat (the observed weighted test statistic) and pvalue_weighted (the p-value for the weighted test statistic) will also be returned.
`condSegProb`	A vector of length equals the number of families. The conditional probability of at least one variant in the gene is segregating in the family condition on at least one variant (among the set of variants to be considered) is present in the familiy.
`segProbGene`	A matrix of the segregating probability for the gene and for each family. This is a working matrix that could be used in other functions.

Author(s)

Dandi Qiao

References

Qiao, D. Lange, C., Laird, N.M., Won, S., Hersh, C.P., et al. (2017). Gene-based segregation method for identifying rare variants for family-based sequencing studies. Genet Epidemiol 41(4):309-319. DOI:10.1002/gepi.22037.

http://scholar.harvard.edu/dqiao/gese

Bureau, A., Younkin, S.G., Parker, M.M., Bailey-Wilson, J.E., Marazita, M.L., et al. (2014). Inferring rare disease risk variants based on exact probabilities of sharing by multiple affected relatives. Bioinformatics 30, 2189-2196. DOI:10.1093/bioinformatics/btu198.

Examples

data(pednew)
data(mapInfo)
data(dataRaw)
data(database)
results <- GESE(pednew, database, 1000000, dataRaw, mapInfo, threshold=1e-3)
results

[Package GESE version 2.0.1 Index]