getLowLcpmCutoff {CpmERCCutoff} | R Documentation |
Function to empirically determine a log2 CPM cutoff based on ERCC RNA spike-in
Description
This function uses spike-in ERCC data, known control RNA probes,
and paired samples to fit a 3rd order polynomial to determine an expression
cutoff that meets the specified correlation between expected and observed
fold changes. The obs
data frame used as input for the observed
expression of the 92 ERCC RNA spike-ins and stores the coverage-normalized
read log2 counts per million (LCPM) that mapped to the respective ERCC
sequences. Typically, prior to LCPM calculation, the read count data is
normalized for any systematic differences in read coverage between samples,
for example, by using the TMM normalization method as implemented in
the edgeR
package.
For each bootstrap replicate, the paired samples are subsampled with
replacement. The mean LCPM of each ERCC transcript is determined by
first calculating the average LCPM value for each paired sample, and
then taking the mean of those averages. The ERCC transcripts are sorted
based on these means, and are then grouped into n.bins
ERCC bins.
Next, the Spearman correlation metric is used to calculate the association
between the empirical and expected log fold change (LFC) of the ERCCs in
each bin for each sample.
Additionally, the average LCPM for the ERCCs in each bin are calculated
for each sample. This leads to a pair of values - the average LCPM and the
association value - for each sample and each ERCC bin. Outliers within
each ERCC bin are identified and removed based on >1.5 IQR.
A 3rd order polynomial is fit with the explanatory variable being the
average LCPM and the response variable being the Spearman correlation
value between expected and observed log2 fold changes.
The fitted curve is used to identify the average LCPM value with a Spearman
correlation of cor.value
. The results are output as an "empLCM"
object as described below. The summary.empLCPM
function can
be used to extract a summary of the results, and the
plot.empLCPM
function to plot the results for visualization.
Usage
getLowLcpmCutoff(
obs,
exp,
pairs,
n.bins = 7,
rep = 1000,
ci = 0.95,
cor.value = 0.9,
remove.outliers = TRUE,
seed = 20220719
)
Arguments
obs |
A data frame of observed spike-in ERCC data. Each row is an ERCC transcript, and each column is a sample. Data are read coverage-normalized log2 counts per million (LCPM). |
exp |
A data frame of expected ERCC Mix 1 and Mix 2 ratios with a column titled 'expected_lfc_ratio' containing the expected log2 fold-change ratios. This data can be obtained from 'ERCC Controls Analysis' manual located on Thermo Fisher's ERCC RNA Spike-In Mix product [page](https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095046.txt). The 'exp_input' data frame mirrors the fields shown in the ERCC manual. For the LCPM cutoff calculation, the last column containing the log2 expected fold change ratios are used. Ensure that this column is titled "expected_lfc_ratio". See the example code below for formatting the data. # |
pairs |
A 2-column data frame where each row indicates a sample pair with the first column indicating the sample that received ERCC spike-ins from Mix 1 and the second column indicating the sample receiving Mix 2. |
n.bins |
Integer. The number of abundance bins to create. Default is 7. |
rep |
Integer. The number of bootstrap replicates. Default is 1000. |
ci |
Numeric. The confidence interval. Default is 0.95. |
cor.value |
Numeric. The desired Spearman correlation between the empirical log2 fold change across the ERCC transcripts. Default is 0.9. |
remove.outliers |
If TRUE (default) outliers are identified as exceeding 1.5 IQR, and are removed prior to fitting the polynomial. Set to FALSE to keep all points. |
seed |
Integer. The reproducibility seed. Default is 20220719. |
Value
An "empLCPM" object is returned, which contains the following named elements:
cutoff | a vector containing 3 values: the threshold value, upper confidence interval, |
and the lower confidence interval value. | |
args | a key: value list of arguments that were provided. |
res | a list containing the main results and other information from the input. |
The summary.empLCPM
function should be used to extract a summary table. |
|
See Also
summary.empLCPM
, plot.empLCPM
,
print.empLCPM
Examples
library(CpmERCCutoff)
##############################
# Load and wrangle input data:
##############################
# Load observed read counts
data("obs_input")
# Set ERCC Ids to rownames
rownames(obs_input) = obs_input$X
# Load expected ERCC data:
data("exp_input")
# Order rows by ERCC ID.
exp_input = exp_input[order(exp_input$ercc_id), ]
rownames(exp_input) = exp_input$ercc_id
# Load metadata:
data("mta_dta")
# Pair samples that received ERCC Mix 1 with samples that received ERCC Mix 2.
# The resulting 2-column data frame is used for the 'pairs' argument.
# Note: the code here will depend on the details of the given experiment. In
# this example, the post-vaccination samples (which received Mix 2)
# for each subject are paired to their pre-vaccination samples (which
# received Mix 1).
pairs_input = cbind(
mta_dta[mta_dta$spike == 2, 'samid'],
mta_dta[match(mta_dta[mta_dta$spike == 2, 'subid'],
mta_dta[mta_dta$spike == 1,'subid']), 'samid'])
# Put Mix 1 in the first column and Mix 2 in the second.
pairs_input = pairs_input[, c(2, 1)]
###############################
# Run getLowLcpmCutoff Function:
###############################'
# Note: Here we use `rep = 10` for only 10 bootstrap replicates
# to decrease the run time for this example; a lager number
# should be used in practice (default = 1000).
res = getLowLcpmCutoff(obs = obs_input,
exp = exp_input,
pairs = pairs_input,
n.bins = 7,
rep = 10,
cor.value = 0.9,
remove.outliers = TRUE,
seed = 20220719)
# Print a short summary of the results:
res
# Extract a summary table of the results:
summary(res)
# Create a plot of the results:
plot(x = res,
main = "Determination of Empirical Minimum Expression Cutoffs using ERCCs",
col.trend = "blue",
col.outlier = c("black", "red"))