R: Gene set testing for single-cell RNA-sequencing data...

twosigmag {twosigma}

R Documentation

Gene set testing for single-cell RNA-sequencing data adjusting for inter-gene correlation.

Description

Gene set testing for single-cell RNA-sequencing data adjusting for inter-gene correlation.

Usage

twosigmag(
  count_matrix,
  index_test,
  index_ref = NULL,
  all_as_ref = FALSE,
  mean_form,
  zi_form,
  mean_form_null = NULL,
  zi_form_null = NULL,
  id,
  statistic,
  lr.df = NULL,
  covar_to_test = NULL,
  contrast_matrix = NULL,
  factor_name = NULL,
  rho = NULL,
  allow_neg_corr = FALSE,
  return_summary_fits = FALSE,
  weights = NULL,
  control = glmmTMBControl(),
  ncores = 1,
  cluster_type = "Fork",
  chunk_size = 10,
  lb = FALSE
)

Arguments

`count_matrix`	Matrix of non-negative integer read counts. It is recommended to make the rownames the gene names for better output. No missing values can be present in the data.
`index_test`	List of indices corresponding to rows of the count matrix that are in the test set. Names of each list element (i.e. Gene Set Names) are carried forward to output if present.
`index_ref`	List of indices corresponding to rows of the count matrix that are in the reference set. If `NULL`, a reference set is randomly selected of the same size as the test size using genes not in the test set (if `all_as_ref=FALSE`) or using all other genes (if `all_as_ref=TRUE`). See `all_as_ref`. Must be either `NULL` or a list with the same length as `index_test`.
`all_as_ref`	Should all genes not in the test set be used as the reference? If `FALSE`, a random subset is taken of size equal to the test size.
`mean_form`	Two-sided model formula for the (conditional) mean model. Formula is passed directly into glmmTMB with random effects specified as in the lme4 package. Users should ensure that the LHS of the formula contains '`count` '.
`zi_form`	One-sided model formula for the zero-inflation model under the alternative. Formula is passed directly into glmmTMB with random effects specified as in the lme4 package.
`mean_form_null`	Two-sided model formula for the (conditional) mean model under the null. Needed if and only if `statistic='LR'`. Syntax is as in `mean_form`. Users should ensure that the LHS of the formula contains '`count` '.
`zi_form_null`	One-sided model formula for the zero-inflation model under the null. Needed if and only if `statistic='LR'`. Syntax is as in `zi_form`.
`id`	Vector of individual-level (sample-level) ID's. Used to estimate inter-gene correlation and random effect prediction (if present) and is currently required.
`statistic`	Which gene-level statistic should be used. Options are Likelihood Ratio ("LR", default), Z-statistic from the mean model ("Z"),the Stouffer's method combined Z-statistic ("Stouffer"), or a contrast of regression parameters ("contrast"). If "Stouffer", covar_to_test must be in both components. If "contrast", covar_to_test is not used and must be `NULL`.
`lr.df`	degrees of freedom for the asymptotic chi-square approximation to the likelihood ratio statistic. Needed if and only if `statistic='LR'`.
`covar_to_test`	Covariate used for reporting direction (as Up or Down) of the test set and for collecting gene-level statistics. Either a string indicating the name of the covariate to use or an integer giving its associated position in the RHS of the mean_form argument. If a string, the name is matched to the predictors of the mean model, so users should ensure such a match would be unique. Not required and should be `NULL` if `statistic='contrast'.`
`contrast_matrix`	Matrix of contrasts of regression parameters from the mean model to be tested. Each row will have separate gene-level and set-level statistics. Rownames of `contrast_matrix` should correspond to a meaningful name of the hypothesis for nicely formatted output. If testing a factor, must have a number of columns exactly equal to the number of levels of the factor. Otherwise, must have one column per parameter in the mean model (including a column for the intercept.)
`factor_name`	Name of the factor being tested by `contrast_matrix`. Needed if and only if `statistic='contrast'` and `contrast_matrix` is testing a factor variable in the mean model.
`rho`	Inter-gene correlation value. If `NULL` (default), estimated using TWO-SIGMA model residuals.
`allow_neg_corr`	Should negative correlation values be allowed? If FALSE, negative correlations are set to zero (leads to conservative inference)..
`return_summary_fits`	If `TRUE`, returns a list containing objects of class `summary.glmmTMB` for each gene.
`weights`	weights, as in `glm`. Defaults to 1 for all observations and no scaling or centering of weights is performed.
`control`	Control parameters for optimization in `glmmTMB`. See `?glmmTMBControl`.
`ncores`	Number of cores used for parallelization. Defaults to 1, meaning no parallelization of any kind is done.
`cluster_type`	Whether to use a "cluster of type "Fork" or "Sock". On Unix systems, "Fork" will likely improve performance. On Windows, only "Sock" will actually result in parallelized computing.
`chunk_size`	Number of genes to be sent to each parallel environment. Parallelization is more efficient, particularly with a large count matrix, when the count matrix is 'chunked' into some common size (e.g. 10, 50, 200). Defaults to 10.
`lb`	Should load balancing be used for parallelization? Users will likely want to set to FALSE for improved performance.

Value

A list with the following elements: ##'

stats_gene_level_all: Gives all gene-level statistics. Order matches the order of the inputted count matrix.
p.vals_gene_level: Gives raw (unadjusted) p-values associated with stats_gene_level_all.
set_p.val: Unadjusted set-level p-values. Order matches the order of inputted test sets.
set_p.val_FDR: FDR-corrected (using the Benjamini-Hochberg procedure) set-level p-values. Order matches the order of inputted test sets.
estimates_gene_level: Gives the average logFC or contrast estimate for each gene.
se_gene_level: Standard error of the gene-level logFC values. Useful to construct gene-level summary statistics.
estimates_set_level: Gives the set-level average of the gene-level logFC or contrast estimates.
direction: Reports whether the test set tends to be Up or Down Regulated based on the sign of estimates_set_level.
corr: Vector of estimated inter-gene correlations for each test set. Order matches the order of inputted test sets.
gene_level_loglik: Vector of log-likelihood values for each gene. Values of NA indicates a model fitting or convergence problem for that gene.
gene_error: Vector indicating whether the particular gene produced an error during model fitting (TRUE) or not (FALSE).
test_sets: Vector of numeric indices corresponding to genes in each test set.
ref_sets: Vector of numeric indices corresponding to the genes in each reference set.
gene_summary_fits: Summary.glmmTMB objects for each gene from the alternative model (if return_summary_fits=TRUE)

Examples

# Set Parameters to Simulate Some Data

nind<-10;ncellsper<-rep(50,nind)
sigma.a<-.5;sigma.b<-.5;phi<-.1
alpha<-c(1,0,-.5,-2);beta<-c(2,0,-.1,.6)
beta2<-c(2,1,-.1,.6)
id.levels<-1:nind;nind<-length(id.levels)
id<-rep(id.levels,times=ncellsper)
sim.seed<-1234

# Simulate individual level covariates

t2d_sim<-rep(rbinom(nind,1,p=.4),times=ncellsper)
cdr_sim<-rbeta(sum(ncellsper),3,6)
age_sim<-rep(sample(c(20:60),size=nind,replace = TRUE),times=ncellsper)

# Construct design matrices

Z<-cbind(scale(t2d_sim),scale(age_sim),scale(cdr_sim))
colnames(Z)<-c("t2d_sim","age_sim","cdr_sim")
X<-cbind(scale(t2d_sim),scale(age_sim),scale(cdr_sim))
colnames(X)<-c("t2d_sim","age_sim","cdr_sim")

# Simulate Data, half under null half under alternative

sim_dat<-matrix(nrow=4,ncol=sum(ncellsper))
for(i in 1:nrow(sim_dat)){
 if(i<2){# Gene Sets Under the Null
   sim_dat[i,]<-simulate_zero_inflated_nb_random_effect_data(ncellsper,X,Z,alpha,beta2
   ,phi,sigma.a,sigma.b,id.levels=NULL)$Y
 }else{# Gene Sets Under the Alternative
   sim_dat[i,]<-simulate_zero_inflated_nb_random_effect_data(ncellsper,X,Z,alpha,beta
   ,phi,sigma.a,sigma.b,id.levels=NULL)$Y
 }
}
rownames(sim_dat)<-paste("Gene",1:4)

# Run twosigmag

twosigmag(sim_dat,index_test = list(c(1,3)),all_as_ref = TRUE,mean_form = count~X
,zi_form = ~0,id=id,covar_to_test  = "t2d_sim",statistic = "Z")

[Package twosigma version 1.0.2 Index]