fitMeanVarCurve {MAnorm2} | R Documentation |
Fit a Mean-Variance Curve
Description
Given a set of bioCond
objects, fitMeanVarCurve
robustly fits a curve capturing the mean-variance dependence across
the genomic intervals contained in them, by iteratively detecting outliers
and removing them from a regression procedure.
Usage
fitMeanVarCurve(
conds,
ratio.var = estimateVarRatio(conds),
method = c("parametric fit", "local regression"),
occupy.only = TRUE,
range.residual = c(1e-04, 15),
max.iter = 50,
init.coef = NULL,
args.lp = list(nn = 0.7),
args.locfit = list(),
verbose = TRUE
)
Arguments
conds |
A list of |
ratio.var |
A vector giving the initial variance ratio factors of the
|
method |
A character string indicating the method to be used for
fitting the curve. Either |
occupy.only |
A logical value. If set to |
range.residual |
A length-two vector specifying the range of residuals of non-outliers. |
max.iter |
Maximum number of iteration times allowed during the fitting procedure. |
init.coef |
An optional length-two vector specifying the initial
coefficients for applying the parametric fitting scheme. Only used when
In practice, chances are that |
args.lp |
A named list of extra arguments to |
args.locfit |
A named list of extra arguments to
|
verbose |
Whether to print processing messages during fitting the mean-variance curve? |
Details
This function performs a regression of the variance of ChIP-seq signal
intensity across replicate samples, using the mean intensity as the only
predictor. Each genomic interval contained in each of the supplied
bioCond
s that consists of two or more ChIP-seq samples serves
as an observation for the regression (the sample mean and sample variance of
the interval's signal intensities in the bioCond
are used as the
predictor value and response, respectively).
Note that bioCond
objects must be normalized to the same level before
a mean-variance curve could be fitted for them. You can choose to either
normalize the involved ChIP-seq samples all together (see
normalize
) or perform the normalization at the level of
bioCond
objects (see normBioCond
and also "Examples"
below).
Value
fitMeanVarCurve
returns the argument list of
bioCond
objects, each of which has an added (updated)
fit.info
field describing its mean-variance dependence. The field
is itself a list consisting of the following components:
calls
The two function calls for associating a mean variance curve with this
bioCond
and estimating the related parameters, respectively. The latter is only present if you have made an explicit call to some function (e.g.,estimatePriorDf
) for performing the parameter estimation.method
Method used for fitting the mean-variance curve.
predict
A function representing the fitted curve, which accepts a vector of means and returns the predicted variances.
mvcID
ID of the fitted mean-variance curve.
df.prior
Number of prior degrees of freedom assessing the goodness of fit of the mean-variance curve.
ratio.var
Variance ratio factor of this
bioCond
.
Each bioCond
object in the returned list has the same values of
all these components but the ratio.var
.
mvcID
is used to label each fitted mean-variance curve. Each
call to fitMeanVarCurve
results in a unique ID. Thus we assert
that different bioCond
objects are associated with the same
mean-variance curve if and only if they have the same mvcID
. This
is useful if you are to
call differential intervals between two conditions
via diffTest
, which requires the two
bioCond
objects being compared are associated with the same
mean-variance curve.
Besides, if there exist bioCond
objects that contain only one
ChIP-seq sample, an attribute named "no.rep.rv"
will be added to
the returned list, recording the variance ratio factor of no-replicate
conditions. Note that the method for estimating the variance ratio
factor of no-replicate conditions is specifically designed (see
estimatePriorDf
for details).
Variance Ratio Factor
fitMeanVarCurve
applies a regression
process to the observed means and variances of signal intensities of
genomic intervals. The regression result serves as a model capturing the
mean-variance trend across intervals.
Notably, each genomic interval in each
bioCond
object that contains replicate samples serves as
an observation point for the regression.
Variance ratio factor is designed to account for the global difference
in variation level of signal intensities between conditions. Each
bioCond
has its own variance ratio factor, and method has been
developed to robustly estimate the relative (scaled) variance ratio
factors of a given set of bioCond
s
(see estimateVarRatio
for details).
Technically, observed variances from each bioCond
are
scaled based on the corresponding (relative) variance ratio factor, so
that the scaled variances from different bioCond
s are comparable
to each other. Finally, the scaled variances from all the provided
bioCond
s are pooled together constituting the vector of responses
for the regression process. Note that the variance ratio factors will be
adjusted afterwards, according to the fitted mean-variance curve and its
goodness of fit (see "Assessing Goodness of Fit" below).
Methods for Fitting a Mean-Variance Curve
There are currently two
candidate methods for performing the regression:
"parametric fit"
(default) and "local regression"
.
Besides, the argument occupy.only
controls whether to use all
genomic intervals or only the occupied ones for the regression process.
Typically, ChIP-seq signal intensities at
non-occupied intervals are much
lower than those at occupied ones. Accordingly, variation levels of the
former are significantly higher than the latter (provided that a log
transformation has been applied to raw read counts before performing
the normalization, which is the default setting of
normalize
). This is because,
for the genomic intervals having
a low-level abundance of ChIP-seq reads, only a little fluctuation of
read count could give rise to a dramatic fold change. If a mean-variance
scatter plot is drawn mapping all genomic intervals to a plane, the
points corresponding to non-occupied intervals will be largely separated
from those of occupied intervals.
In practice, the ChIP-seq signals located in
non-occupied intervals result
primarily from background noise and therefore have much lower
signal-to-noise ratios than those in occupied intervals. As a result,
signals observed in the two types of intervals
almost always have distinct
data characteristics from one another. In particular, the mean-variance
dependence associated with non-occupied intervals is not as regular as
observed from occupied intervals. In light of these observations, the
recommended setting of occupy.only
may be different across calls
of fitMeanVarCurve
depending on the exact method
chosen
for performing the regression. See the following for details.
For the method of "parametric fit"
, it adopts the parametric form
of var = c1 + c2 / (2 ^ mean)
, where c1
and c2
are
coefficients to be estimated. More specifically, it fits a gamma-family
generalized linear model with the identity link. The form is
deduced by assuming a quadratic mean-variance relationship for raw read
counts and applying the delta method to log2 transformation (see also
"References"). When using this method, one typically sets
occupy.only
to TRUE
(the default). Otherwise, the
GLM fitting procedure may fail to estimate the coefficients, or the
estimation results may be significantly biased towards the
characteristics of ChIP-seq signals at non-occupied intervals (which is
undesired since these signals are mostly background noises). Note also
that applying this method is most recommended when ChIP-seq samples
within each single bioCond
are associated with a low level
of signal variation (e.g., when these samples are biological replicates
of a cell line; see also "Examples" below),
since in such cases ChIP-seq data should be of high regularity
and, thus, the parametric form could be safely expected to hold.
Moreover, as the variation level across ChIP-seq samples increases,
the possibility becomes higher that the GLM fitting procedure fails.
For the method of "local regression"
, it directly passes the
observed means and scaled variances to the locfit
function, specifying the family
to be "gamma"
. When using
this method, setting occupy.only
to TRUE
almost certainly
leads to an exaggerated variance prediction for small signal intensities
(due to the underlying algorithm for extrapolation)
and, thus, a reduction in statistical power for detecting differential
intervals between conditions. On the other hand,
involving non-occupied intervals
in the fitting process might result in an underestimated number of prior
degrees of freedom (see "Assessing Goodness of Fit" below). This is
because the ChIP-seq signals located in non-occupied intervals generally
have low signal-to-noise ratios, and are therefore associated with less
data regularity than the signals in occupied intervals. One way to
compensate that is to re-estimate the prior df using
only the occupied intervals after fitting the mean-variance curve (see
estimatePriorDf
and "Examples" below), which is also the
most recommended strategy for employing a local regression. Note also
that smoothness of the resulting curve could be adjusted by modifying
the nn
variable in args.lp
(see also
lp
). By default, nn=0.7
is adopted, which
is also the default setting of lp
at the time of
developing this package.
Iteration Scheme for a Robust Regression
Whichever method is
selected, fitMeanVarCurve
adopts an iteration scheme to enhance
the robustness of fitting the mean-variance trend. More specifically,
it iteratively detects and removes outliers from a regression procedure.
The process converges as soon as the set of outliers fixes.
Residual of each observation is calculated as the ratio of its observed
variance to the fitted one, and those observations with a residual
falling outside range.residual
shall be considered as outliers.
The default value of range.residual
works well for chi-squared
distributions with a broad range of numbers of degrees of freedom (see
also "References").
Assessing Goodness of Fit
Each fitted mean-variance curve is
associated with a quantity assessing its goodness of fit, which is the
number of prior degrees of freedom. Roughly speaking, the closer the
observed mean-variance points are to the curve, the larger the resulting
prior df of the curve, and we get more confidence in the curve.
To be noted, the initial variance ratio factors for scaling the sample
variances from different bioCond
objects will be adjusted
according to the estimated prior df (based on the
underlying distributional theory).
These adjusted variance ratio factors are
exactly the ones stored in the returned bioCond
objects.
See estimatePriorDf
for details about estimating prior
df and accordingly adjusting variance ratio factors.
Note also that fitMeanVarCurve
uses exactly the set of intervals that
are utilized for fitting the mean-variance curve to estimate the prior
df and adjust the variance ratio factors (the set is controlled by the
argument occupy.only
; see also
"Methods for Fitting a Mean-Variance Curve" above).
Prior df is primarily used for the following
differential analysis. We call a variance read from a mean-variance
curve a prior one. In cases where you use
diffTest
to call differential intervals
between two bioCond
s, the final variance estimate associated
with each individual interval is obtained by averaging its observed
and prior variances, weighted by their respective numbers of degrees
of freedom.
Extending the Application Scope of a Mean-Variance Curve
With a
set of bioCond
objects at hand, you might want to use only
part of them to fit the mean-variance curve.
For example, suppose ChIP-seq samples
stored in some bioCond
objects are associated with a low data
regularity (due to, e.g., bad sample qualities), and you don't
want to include these samples when fitting the curve. One way to work
around it is to exclude the bioCond
objects from the fitting
process, extend the application scope of the fitted curve (via
extendMeanVarCurve
) so that it applies to the excluded
bioCond
s as well, and (optionally) re-assess the overall goodness
of fit via estimatePriorDf
(see also the "Examples"
given for extendMeanVarCurve
).
There is another scenario where extending a mean-variance curve could be
useful. In practice, chances are that only one ChIP-seq sample is
available for each of two conditions to be compared. To make the
analysis possible, one way is to treat the two samples as replicates and
fit a mean-variance curve accordingly. The fitted curve can then be
associated with the two conditions each containing a single sample (via
extendMeanVarCurve
),
and differential intervals between them
can be subsequently called following a regular routine (see "Examples"
provided in extendMeanVarCurve
).
To be noted, this method is most suited when the two
conditions being compared are close. Otherwise, the method may lead to
an over-conserved p-value calculation.
References
Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3.
Anders, S. and W. Huber, Differential expression analysis for sequence count data. Genome Biol, 2010. 11(10): p. R106.
Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.
Tu, S., et al., MAnorm2 for quantitatively comparing groups of ChIP-seq samples. Genome Res, 2021. 31(1): p. 131-145.
See Also
bioCond
for creating a bioCond
object from a
set of ChIP-seq samples; normalize
for performing an MA
normalization on ChIP-seq samples; normalizeBySizeFactors
for normalizing ChIP-seq samples based on their size factors;
normBioCond
for performing an MA normalization on
bioCond
objects; normBioCondBySizeFactors
for
normalizing bioCond
objects based on their size factors.
estimateVarRatio
for estimating the relative variance
ratio factors of a set of bioCond
s; varRatio
for a
formal description of variance ratio factor;
estimatePriorDf
for estimating the number of prior degrees
of freedom as well as adjusting variance ratio factors accordingly;
estimatePriorDfRobust
for a robust version of
estimatePriorDf
.
setMeanVarCurve
for setting the mean-variance curve of a
set of bioCond
objects; extendMeanVarCurve
for
extending the application scope of a fitted mean-variance curve to the
bioCond
s not used to fit it; plotMeanVarCurve
for
plotting a mean-variance curve.
distBioCond
for robustly measuring the distances between
ChIP-seq samples in a bioCond
by considering its mean-variance
trend; vstBioCond
for applying a variance-stabilizing
transformation to signal intensities of samples in a bioCond
.
diffTest
for calling differential
intervals between two bioCond
objects; aovBioCond
for calling differential intervals across multiple bioCond
s;
varTestBioCond
for calling hypervariable and invariant
intervals across ChIP-seq samples contained in a bioCond
.
Examples
data(H3K27Ac, package = "MAnorm2")
attr(H3K27Ac, "metaInfo")
## Fit a mean-variance curve treating each cell line (i.e., individual) as a
## biological condition.
# Perform the MA normalization and construct bioConds to represent cell
# lines.
norm <- normalize(H3K27Ac, 4, 9)
norm <- normalize(norm, 5:6, 10:11)
norm <- normalize(norm, 7:8, 12:13)
conds <- list(GM12890 = bioCond(norm[4], norm[9], name = "GM12890"),
GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
conds <- normBioCond(conds, common.peak.regions = autosome)
# Variations in ChIP-seq signals across biological replicates of a cell line
# are generally of a low level, and their relationship with the mean signal
# intensities is expected to be well modeled by the presumed parametric
# form.
conds <- fitMeanVarCurve(conds, method = "parametric", occupy.only = TRUE)
summary(conds[[1]])
plotMeanVarCurve(conds, subset = "occupied")
## Not run:
# Sometimes the parametric fitting algorithm cannot automatically deduce
# proper starting values for estimating the coefficients.
fitMeanVarCurve(conds[3], method = "parametric", occupy.only = TRUE)
## End(Not run)
# In such cases, explicitly specify the initial values of the coefficients.
fitMeanVarCurve(conds[3], method = "parametric", occupy.only = TRUE,
init.coef = c(0.1, 10))
## Fit a mean-variance curve treating each gender as a biological condition,
## and each individual a replicate.
# Group individuals into bioConds based on their genders.
female <- cmbBioCond(conds[c(1, 3)], name = "female")
male <- cmbBioCond(conds[2], name = "male")
# The dependence of variance of ChIP-seq signal intensity across individuals
# on the mean signal intensity is not as regular as in the case for modeling
# biological replicates of cell lines. Better use the local regression to
# adaptively capture the mean-variance trend.
genders <- list(female = female, male = male)
genders1 <- fitMeanVarCurve(genders, method = "local", occupy.only = TRUE)
genders2 <- fitMeanVarCurve(genders, method = "local", occupy.only = FALSE)
# Suppose the local regression is performed using only the occupied genomic
# intervals as input. Good chances are that the extrapolation algorithm
# implemented in the regression method will produce over-estimated variances
# for the non-occupied intervals.
plotMeanVarCurve(genders1, subset = "all")
plotMeanVarCurve(genders2, subset = "all")
plotMeanVarCurve(genders1, subset = "non-occupied")
plotMeanVarCurve(genders2, subset = "non-occupied")
# On the other hand, applying the local regression on all genomic intervals
# may considerably reduce the estimated number of prior degrees of freedom
# associated with the fitted mean-variance curve, as ChIP-seq signals in the
# non-occupied intervals are generally of less data regularity compared with
# those in the occupied intervals.
summary(genders1$female)
summary(genders2$female)
# To split the difference, fit the mean-variance curve on all genomic
# intervals and re-estimate the number of prior degrees of freedom using
# only the occupied intervals, which is also the most recommended strategy
# in practice.
genders3 <- estimatePriorDf(genders2, occupy.only = TRUE)
plotMeanVarCurve(genders3, subset = "all")
plotMeanVarCurve(genders3, subset = "non-occupied")
summary(genders3$female)