twosigmag {twosigma}R Documentation

Gene set testing for single-cell RNA-sequencing data adjusting for inter-gene correlation.

Description

Gene set testing for single-cell RNA-sequencing data adjusting for inter-gene correlation.

Usage

twosigmag(
  count_matrix,
  index_test,
  index_ref = NULL,
  all_as_ref = FALSE,
  mean_form,
  zi_form,
  mean_form_null = NULL,
  zi_form_null = NULL,
  id,
  statistic,
  lr.df = NULL,
  covar_to_test = NULL,
  contrast_matrix = NULL,
  factor_name = NULL,
  rho = NULL,
  allow_neg_corr = FALSE,
  return_summary_fits = FALSE,
  weights = NULL,
  control = glmmTMBControl(),
  ncores = 1,
  cluster_type = "Fork",
  chunk_size = 10,
  lb = FALSE
)

Arguments

count_matrix

Matrix of non-negative integer read counts. It is recommended to make the rownames the gene names for better output. No missing values can be present in the data.

index_test

List of indices corresponding to rows of the count matrix that are in the test set. Names of each list element (i.e. Gene Set Names) are carried forward to output if present.

index_ref

List of indices corresponding to rows of the count matrix that are in the reference set. If NULL, a reference set is randomly selected of the same size as the test size using genes not in the test set (if all_as_ref=FALSE) or using all other genes (if all_as_ref=TRUE). See all_as_ref. Must be either NULL or a list with the same length as index_test.

all_as_ref

Should all genes not in the test set be used as the reference? If FALSE, a random subset is taken of size equal to the test size.

mean_form

Two-sided model formula for the (conditional) mean model. Formula is passed directly into glmmTMB with random effects specified as in the lme4 package. Users should ensure that the LHS of the formula contains 'count '.

zi_form

One-sided model formula for the zero-inflation model under the alternative. Formula is passed directly into glmmTMB with random effects specified as in the lme4 package.

mean_form_null

Two-sided model formula for the (conditional) mean model under the null. Needed if and only if statistic='LR'. Syntax is as in mean_form. Users should ensure that the LHS of the formula contains 'count '.

zi_form_null

One-sided model formula for the zero-inflation model under the null. Needed if and only if statistic='LR'. Syntax is as in zi_form.

id

Vector of individual-level (sample-level) ID's. Used to estimate inter-gene correlation and random effect prediction (if present) and is currently required.

statistic

Which gene-level statistic should be used. Options are Likelihood Ratio ("LR", default), Z-statistic from the mean model ("Z"),the Stouffer's method combined Z-statistic ("Stouffer"), or a contrast of regression parameters ("contrast"). If "Stouffer", covar_to_test must be in both components. If "contrast", covar_to_test is not used and must be NULL.

lr.df

degrees of freedom for the asymptotic chi-square approximation to the likelihood ratio statistic. Needed if and only if statistic='LR'.

covar_to_test

Covariate used for reporting direction (as Up or Down) of the test set and for collecting gene-level statistics. Either a string indicating the name of the covariate to use or an integer giving its associated position in the RHS of the mean_form argument. If a string, the name is matched to the predictors of the mean model, so users should ensure such a match would be unique. Not required and should be NULL if statistic='contrast'.

contrast_matrix

Matrix of contrasts of regression parameters from the mean model to be tested. Each row will have separate gene-level and set-level statistics. Rownames of contrast_matrix should correspond to a meaningful name of the hypothesis for nicely formatted output. If testing a factor, must have a number of columns exactly equal to the number of levels of the factor. Otherwise, must have one column per parameter in the mean model (including a column for the intercept.)

factor_name

Name of the factor being tested by contrast_matrix. Needed if and only if statistic='contrast' and contrast_matrix is testing a factor variable in the mean model.

rho

Inter-gene correlation value. If NULL (default), estimated using TWO-SIGMA model residuals.

allow_neg_corr

Should negative correlation values be allowed? If FALSE, negative correlations are set to zero (leads to conservative inference)..

return_summary_fits

If TRUE, returns a list containing objects of class summary.glmmTMB for each gene.

weights

weights, as in glm. Defaults to 1 for all observations and no scaling or centering of weights is performed.

control

Control parameters for optimization in glmmTMB. See ?glmmTMBControl.

ncores

Number of cores used for parallelization. Defaults to 1, meaning no parallelization of any kind is done.

cluster_type

Whether to use a "cluster of type "Fork" or "Sock". On Unix systems, "Fork" will likely improve performance. On Windows, only "Sock" will actually result in parallelized computing.

chunk_size

Number of genes to be sent to each parallel environment. Parallelization is more efficient, particularly with a large count matrix, when the count matrix is 'chunked' into some common size (e.g. 10, 50, 200). Defaults to 10.

lb

Should load balancing be used for parallelization? Users will likely want to set to FALSE for improved performance.

Value

A list with the following elements: ##'

Examples

# Set Parameters to Simulate Some Data

nind<-10;ncellsper<-rep(50,nind)
sigma.a<-.5;sigma.b<-.5;phi<-.1
alpha<-c(1,0,-.5,-2);beta<-c(2,0,-.1,.6)
beta2<-c(2,1,-.1,.6)
id.levels<-1:nind;nind<-length(id.levels)
id<-rep(id.levels,times=ncellsper)
sim.seed<-1234

# Simulate individual level covariates

t2d_sim<-rep(rbinom(nind,1,p=.4),times=ncellsper)
cdr_sim<-rbeta(sum(ncellsper),3,6)
age_sim<-rep(sample(c(20:60),size=nind,replace = TRUE),times=ncellsper)

# Construct design matrices

Z<-cbind(scale(t2d_sim),scale(age_sim),scale(cdr_sim))
colnames(Z)<-c("t2d_sim","age_sim","cdr_sim")
X<-cbind(scale(t2d_sim),scale(age_sim),scale(cdr_sim))
colnames(X)<-c("t2d_sim","age_sim","cdr_sim")

# Simulate Data, half under null half under alternative

sim_dat<-matrix(nrow=4,ncol=sum(ncellsper))
for(i in 1:nrow(sim_dat)){
 if(i<2){# Gene Sets Under the Null
   sim_dat[i,]<-simulate_zero_inflated_nb_random_effect_data(ncellsper,X,Z,alpha,beta2
   ,phi,sigma.a,sigma.b,id.levels=NULL)$Y
 }else{# Gene Sets Under the Alternative
   sim_dat[i,]<-simulate_zero_inflated_nb_random_effect_data(ncellsper,X,Z,alpha,beta
   ,phi,sigma.a,sigma.b,id.levels=NULL)$Y
 }
}
rownames(sim_dat)<-paste("Gene",1:4)

# Run twosigmag

twosigmag(sim_dat,index_test = list(c(1,3)),all_as_ref = TRUE,mean_form = count~X
,zi_form = ~0,id=id,covar_to_test  = "t2d_sim",statistic = "Z")

[Package twosigma version 1.0.2 Index]