twosigmag {twosigma} | R Documentation |
Gene set testing for single-cell RNA-sequencing data adjusting for inter-gene correlation.
Description
Gene set testing for single-cell RNA-sequencing data adjusting for inter-gene correlation.
Usage
twosigmag(
count_matrix,
index_test,
index_ref = NULL,
all_as_ref = FALSE,
mean_form,
zi_form,
mean_form_null = NULL,
zi_form_null = NULL,
id,
statistic,
lr.df = NULL,
covar_to_test = NULL,
contrast_matrix = NULL,
factor_name = NULL,
rho = NULL,
allow_neg_corr = FALSE,
return_summary_fits = FALSE,
weights = NULL,
control = glmmTMBControl(),
ncores = 1,
cluster_type = "Fork",
chunk_size = 10,
lb = FALSE
)
Arguments
count_matrix |
Matrix of non-negative integer read counts. It is recommended to make the rownames the gene names for better output. No missing values can be present in the data. |
index_test |
List of indices corresponding to rows of the count matrix that are in the test set. Names of each list element (i.e. Gene Set Names) are carried forward to output if present. |
index_ref |
List of indices corresponding to rows of the count matrix that are in the reference set. If |
all_as_ref |
Should all genes not in the test set be used as the reference? If |
mean_form |
Two-sided model formula for the (conditional) mean model. Formula is passed directly into glmmTMB with random effects specified as in the lme4 package. Users should ensure that the LHS of the formula contains ' |
zi_form |
One-sided model formula for the zero-inflation model under the alternative. Formula is passed directly into glmmTMB with random effects specified as in the lme4 package. |
mean_form_null |
Two-sided model formula for the (conditional) mean model under the null. Needed if and only if |
zi_form_null |
One-sided model formula for the zero-inflation model under the null. Needed if and only if |
id |
Vector of individual-level (sample-level) ID's. Used to estimate inter-gene correlation and random effect prediction (if present) and is currently required. |
statistic |
Which gene-level statistic should be used. Options are Likelihood Ratio ("LR", default), Z-statistic from the mean model ("Z"),the Stouffer's method combined Z-statistic ("Stouffer"), or a contrast of regression parameters ("contrast"). If "Stouffer", covar_to_test must be in both components. If "contrast", covar_to_test is not used and must be |
lr.df |
degrees of freedom for the asymptotic chi-square approximation to the likelihood ratio statistic. Needed if and only if |
covar_to_test |
Covariate used for reporting direction (as Up or Down) of the test set and for collecting gene-level statistics. Either a string indicating the name of the covariate to use or an integer giving its associated position in the RHS of the mean_form argument. If a string, the name is matched to the predictors of the mean model, so users should ensure such a match would be unique. Not required and should be |
contrast_matrix |
Matrix of contrasts of regression parameters from the mean model to be tested. Each row will have separate gene-level and set-level statistics. Rownames of |
factor_name |
Name of the factor being tested by |
rho |
Inter-gene correlation value. If |
allow_neg_corr |
Should negative correlation values be allowed? If FALSE, negative correlations are set to zero (leads to conservative inference).. |
return_summary_fits |
If |
weights |
weights, as in |
control |
Control parameters for optimization in |
ncores |
Number of cores used for parallelization. Defaults to 1, meaning no parallelization of any kind is done. |
cluster_type |
Whether to use a "cluster of type "Fork" or "Sock". On Unix systems, "Fork" will likely improve performance. On Windows, only "Sock" will actually result in parallelized computing. |
chunk_size |
Number of genes to be sent to each parallel environment. Parallelization is more efficient, particularly with a large count matrix, when the count matrix is 'chunked' into some common size (e.g. 10, 50, 200). Defaults to 10. |
lb |
Should load balancing be used for parallelization? Users will likely want to set to FALSE for improved performance. |
Value
A list with the following elements: ##'
stats_gene_level_all:
Gives all gene-level statistics. Order matches the order of the inputted count matrix.p.vals_gene_level:
Gives raw (unadjusted) p-values associated withstats_gene_level_all
.set_p.val:
Unadjusted set-level p-values. Order matches the order of inputted test sets.set_p.val_FDR:
FDR-corrected (using the Benjamini-Hochberg procedure) set-level p-values. Order matches the order of inputted test sets.estimates_gene_level:
Gives the average logFC or contrast estimate for each gene.se_gene_level:
Standard error of the gene-level logFC values. Useful to construct gene-level summary statistics.estimates_set_level:
Gives the set-level average of the gene-level logFC or contrast estimates.direction:
Reports whether the test set tends to be Up or Down Regulated based on the sign ofestimates_set_level
.corr:
Vector of estimated inter-gene correlations for each test set. Order matches the order of inputted test sets.gene_level_loglik:
Vector of log-likelihood values for each gene. Values of NA indicates a model fitting or convergence problem for that gene.gene_error:
Vector indicating whether the particular gene produced an error during model fitting (TRUE) or not (FALSE).test_sets:
Vector of numeric indices corresponding to genes in each test set.ref_sets:
Vector of numeric indices corresponding to the genes in each reference set.gene_summary_fits:
Summary.glmmTMB objects for each gene from the alternative model (if return_summary_fits=TRUE)
Examples
# Set Parameters to Simulate Some Data
nind<-10;ncellsper<-rep(50,nind)
sigma.a<-.5;sigma.b<-.5;phi<-.1
alpha<-c(1,0,-.5,-2);beta<-c(2,0,-.1,.6)
beta2<-c(2,1,-.1,.6)
id.levels<-1:nind;nind<-length(id.levels)
id<-rep(id.levels,times=ncellsper)
sim.seed<-1234
# Simulate individual level covariates
t2d_sim<-rep(rbinom(nind,1,p=.4),times=ncellsper)
cdr_sim<-rbeta(sum(ncellsper),3,6)
age_sim<-rep(sample(c(20:60),size=nind,replace = TRUE),times=ncellsper)
# Construct design matrices
Z<-cbind(scale(t2d_sim),scale(age_sim),scale(cdr_sim))
colnames(Z)<-c("t2d_sim","age_sim","cdr_sim")
X<-cbind(scale(t2d_sim),scale(age_sim),scale(cdr_sim))
colnames(X)<-c("t2d_sim","age_sim","cdr_sim")
# Simulate Data, half under null half under alternative
sim_dat<-matrix(nrow=4,ncol=sum(ncellsper))
for(i in 1:nrow(sim_dat)){
if(i<2){# Gene Sets Under the Null
sim_dat[i,]<-simulate_zero_inflated_nb_random_effect_data(ncellsper,X,Z,alpha,beta2
,phi,sigma.a,sigma.b,id.levels=NULL)$Y
}else{# Gene Sets Under the Alternative
sim_dat[i,]<-simulate_zero_inflated_nb_random_effect_data(ncellsper,X,Z,alpha,beta
,phi,sigma.a,sigma.b,id.levels=NULL)$Y
}
}
rownames(sim_dat)<-paste("Gene",1:4)
# Run twosigmag
twosigmag(sim_dat,index_test = list(c(1,3)),all_as_ref = TRUE,mean_form = count~X
,zi_form = ~0,id=id,covar_to_test = "t2d_sim",statistic = "Z")