R: Adjust for batch effects

adjust_batch {batchtma}

R Documentation

Adjust for batch effects

Description

adjust_batch generates biomarker levels for the variable(s) markers in the dataset data that are corrected (adjusted) for batch effects, i.e. differential measurement error between levels of batch.

Usage

adjust_batch(
  data,
  markers,
  batch,
  method = c("simple", "standardize", "ipw", "quantreg", "quantnorm"),
  confounders = NULL,
  suffix = "_adjX",
  ipw_truncate = c(0.025, 0.975),
  quantreg_tau = c(0.25, 0.75),
  quantreg_method = "fn"
)

Arguments

`data`	Data set
`markers`	Variable name(s) to batch-adjust. Select multiple variables with tidy evaluation, e.g., `markers = starts_with("biomarker")`.
`batch`	Categorical variable indicating batch.
`method`	Method for batch effect correction: `simple` Simple means per batch will be subtracted. No adjustment for confounders. `standardize` Means per batch after standardization for confounders in linear models will be subtracted. If no `confounders` are supplied, `method = simple` is equivalent and will be used. `ipw` Means per batch after inverse-probability weighting for assignment to a specific batch in multinomial models, conditional on confounders, will be subtracted. Stabilized weights are used, truncated at quantiles defined by the `ipw_truncate` parameters. If no `confounders` are supplied, `method = simple` is equivalent and will be used. `quantreg` Lower quantiles (default: 25th percentile) and ranges between a lower and an upper quantile (default: 75th percentile) will be unified between batches, allowing for differences in both parameters due to confounders. Set the two quantiles using the `quantreg_tau` parameters. `quantnorm` Quantile normalization between batches. No adjustment for confounders.
`confounders`	Optional: Confounders, i.e. determinants of biomarker levels that differ between batches. Only used if `method = standardize`, `method = ipw`, or `method = quantreg`, i.e. methods that attempt to retain some of these "true" between-batch differences. Select multiple confounders with tidy evaluation, e.g., `confounders = c(age, age_squared, sex)`.
`suffix`	Optional: What string to append to variable names after batch adjustment. Defaults to `"_adjX"`, with `X` depending on `method`: `_adj2` from `method = simple` `_adj3` from `method = standardize` `_adj4` from `method = ipw` `_adj5` from `method = quantreg` `_adj6` from `method = quantnorm`
`ipw_truncate`	Optional and used for `method = ipw` only: Lower and upper quantiles for truncation of stabilized weights. Defaults to `c(0.025, 0.975)`.
`quantreg_tau`	Optional and used for `method = quantreg` only: Quantiles to scale. Defaults to `c(0.25, 0.75)`.
`quantreg_method`	Optional and used for `method = quantreg` only: Algorithmic method to fit quantile regression. Defaults to `"fn"`. See parameter `method` of `rq`.

Details

If no true differences between batches are expected, because samples have been randomized to batches, then a method that returns adjusted values with equal means (method = simple) or with equal rank values (method = quantnorm) for all batches is appropriate.

If the distribution of determinants of biomarker values (confounders) differs between batches, then a method that retains these "true" differences between batches while adjusting for batch effects may be appropriate: method = standardize and method = ipw address means; method = quantreg addresses lower values and dynamic range separately.

Which method to choose depends on the properties of batch effects (affecting means or also variance?) and the presence and strength of confounding. For the two mean-only confounder-adjusted methods, the choice may depend on whether the confounder–batch association (method = ipw) or the confounder–biomarker association (method = standardize) can be modeled better. Generally, if batch effects are present, any adjustment method tends to perform better than no adjustment in reducing bias and increasing between-study reproducibility. See references.

All adjustment approaches except method = quantnorm are based on linear models. It is recommended that variables for markers and confounders first be transformed as necessary (e.g., log transformations or splines). Scaling or mean centering are not necessary, and adjusted values are returned on the original scale. Parameters markers, batch, and confounders support tidy evaluation.

Observations with missing values for the markers and confounders will be ignored in the estimation of adjustment parameters, as are empty batches. Batch effect-adjusted values for observations with existing marker values but missing confounders are based on adjustment parameters derived from the other observations in a batch with non-missing confounders.

Value

The data dataset with batch effect-adjusted variable(s) added at the end. Model diagnostics, using the attribute .batchtma of this dataset, are available via the diagnose_models function.

Author(s)

Konrad H. Stopsack

References

Stopsack KH, Tyekucheva S, Wang M, Gerke TA, Vaselkiv JB, Penney KL, Kantoff PW, Finn SP, Fiorentino M, Loda M, Lotan TL, Parmigiani G+, Mucci LA+ (+ equal contribution). Extent, impact, and mitigation of batch effects in tumor biomarker studies using tissue microarrays. bioRxiv 2021.06.29.450369; doi: https://doi.org/10.1101/2021.06.29.450369 (This R package, all methods descriptions, and further recommendations.)

Rosner B, Cook N, Portman R, Daniels S, Falkner B. Determination of blood pressure percentiles in normal-weight children: some methodological issues. Am J Epidemiol 2008;167(6):653-66. (Basis for method = standardize)

Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003;19:185–193. (method = quantnorm)

Examples

# Data frame with two batches
# Batch 2 has higher values of biomarker and confounder
df <- data.frame(
  tma = rep(1:2, times = 10),
  biomarker = rep(1:2, times = 10) +
    runif(max = 5, n = 20),
  confounder = rep(0:1, times = 10) +
    runif(max = 10, n = 20)
)

# Adjust for batch effects
# Using simple means, ignoring the confounder:
adjust_batch(
  data = df,
  markers = biomarker,
  batch = tma,
  method = simple
)
# Returns data set with new variable "biomarker_adj2"

# Use quantile regression, include the confounder,
# change suffix of returned variable:
adjust_batch(
  data = df,
  markers = biomarker,
  batch = tma,
  method = quantreg,
  confounders = confounder,
  suffix = "_batchadjusted"
)
# Returns data set with new variable "biomarker_batchadjusted"

[Package batchtma version 0.1.6 Index]