R: Function to empirically determine a log2 CPM cutoff based on...

getLowLcpmCutoff {CpmERCCutoff}

R Documentation

Function to empirically determine a log2 CPM cutoff based on ERCC RNA spike-in

Description

This function uses spike-in ERCC data, known control RNA probes, and paired samples to fit a 3rd order polynomial to determine an expression cutoff that meets the specified correlation between expected and observed fold changes. The obs data frame used as input for the observed expression of the 92 ERCC RNA spike-ins and stores the coverage-normalized read log2 counts per million (LCPM) that mapped to the respective ERCC sequences. Typically, prior to LCPM calculation, the read count data is normalized for any systematic differences in read coverage between samples, for example, by using the TMM normalization method as implemented in the edgeR package.

For each bootstrap replicate, the paired samples are subsampled with replacement. The mean LCPM of each ERCC transcript is determined by first calculating the average LCPM value for each paired sample, and then taking the mean of those averages. The ERCC transcripts are sorted based on these means, and are then grouped into n.bins ERCC bins. Next, the Spearman correlation metric is used to calculate the association between the empirical and expected log fold change (LFC) of the ERCCs in each bin for each sample. Additionally, the average LCPM for the ERCCs in each bin are calculated for each sample. This leads to a pair of values - the average LCPM and the association value - for each sample and each ERCC bin. Outliers within each ERCC bin are identified and removed based on >1.5 IQR. A 3rd order polynomial is fit with the explanatory variable being the average LCPM and the response variable being the Spearman correlation value between expected and observed log2 fold changes. The fitted curve is used to identify the average LCPM value with a Spearman correlation of cor.value. The results are output as an "empLCM" object as described below. The summary.empLCPM function can be used to extract a summary of the results, and the plot.empLCPM function to plot the results for visualization.

Usage

getLowLcpmCutoff(
  obs,
  exp,
  pairs,
  n.bins = 7,
  rep = 1000,
  ci = 0.95,
  cor.value = 0.9,
  remove.outliers = TRUE,
  seed = 20220719
)

Arguments

`obs`	A data frame of observed spike-in ERCC data. Each row is an ERCC transcript, and each column is a sample. Data are read coverage-normalized log2 counts per million (LCPM).
`exp`	A data frame of expected ERCC Mix 1 and Mix 2 ratios with a column titled 'expected_lfc_ratio' containing the expected log2 fold-change ratios. This data can be obtained from 'ERCC Controls Analysis' manual located on Thermo Fisher's ERCC RNA Spike-In Mix product [page](https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095046.txt). The 'exp_input' data frame mirrors the fields shown in the ERCC manual. For the LCPM cutoff calculation, the last column containing the log2 expected fold change ratios are used. Ensure that this column is titled "expected_lfc_ratio". See the example code below for formatting the data. #
`pairs`	A 2-column data frame where each row indicates a sample pair with the first column indicating the sample that received ERCC spike-ins from Mix 1 and the second column indicating the sample receiving Mix 2.
`n.bins`	Integer. The number of abundance bins to create. Default is 7.
`rep`	Integer. The number of bootstrap replicates. Default is 1000.
`ci`	Numeric. The confidence interval. Default is 0.95.
`cor.value`	Numeric. The desired Spearman correlation between the empirical log2 fold change across the ERCC transcripts. Default is 0.9.
`remove.outliers`	If TRUE (default) outliers are identified as exceeding 1.5 IQR, and are removed prior to fitting the polynomial. Set to FALSE to keep all points.
`seed`	Integer. The reproducibility seed. Default is 20220719.

Value

An "empLCPM" object is returned, which contains the following named elements:

`cutoff`	a vector containing 3 values: the threshold value, upper confidence interval,
	and the lower confidence interval value.
`args`	a key: value list of arguments that were provided.
`res`	a list containing the main results and other information from the input.
	The `summary.empLCPM` function should be used to extract a summary table.

Examples

library(CpmERCCutoff)
##############################
# Load and wrangle input data:
##############################
# Load observed read counts
data("obs_input")

# Set ERCC Ids to rownames
rownames(obs_input) = obs_input$X

# Load expected ERCC data:
data("exp_input")

# Order rows by ERCC ID.
exp_input = exp_input[order(exp_input$ercc_id), ]
rownames(exp_input) = exp_input$ercc_id

# Load metadata:
data("mta_dta")

# Pair samples that received ERCC Mix 1 with samples that received ERCC Mix 2.
# The resulting 2-column data frame is used for the 'pairs' argument.
# Note: the code here will depend on the details of the given experiment. In
#       this example, the post-vaccination samples (which received Mix 2)
#       for each subject are paired to their pre-vaccination samples (which
#       received Mix 1).
pairs_input = cbind(
  mta_dta[mta_dta$spike == 2, 'samid'],
  mta_dta[match(mta_dta[mta_dta$spike == 2, 'subid'],
                mta_dta[mta_dta$spike == 1,'subid']), 'samid'])
# Put Mix 1 in the first column and Mix 2 in the second.
pairs_input = pairs_input[, c(2, 1)]

###############################
# Run getLowLcpmCutoff Function:
###############################'
# Note: Here we use `rep = 10` for only 10 bootstrap replicates
#       to decrease the run time for this example; a lager number
#       should be used in practice (default = 1000).
res = getLowLcpmCutoff(obs = obs_input,
                       exp = exp_input,
                       pairs = pairs_input,
                       n.bins = 7,
                       rep = 10,
                       cor.value = 0.9,
		                  remove.outliers = TRUE,
                       seed = 20220719)

# Print a short summary of the results:
res

# Extract a summary table of the results:
summary(res)

# Create a plot of the results:
plot(x = res,
     main = "Determination of Empirical Minimum Expression Cutoffs using ERCCs",
     col.trend = "blue",
     col.outlier = c("black", "red"))