calculate_diff_abundance {protti}R Documentation

Calculate differential abundance between conditions

Description

Performs differential abundance calculations and statistical hypothesis tests on data frames with protein, peptide or precursor data. Different methods for statistical testing are available.

Usage

calculate_diff_abundance(
  data,
  sample,
  condition,
  grouping,
  intensity_log2,
  missingness = missingness,
  comparison = comparison,
  mean = NULL,
  sd = NULL,
  n_samples = NULL,
  ref_condition = "all",
  filter_NA_missingness = TRUE,
  method = c("moderated_t-test", "t-test", "t-test_mean_sd", "proDA"),
  p_adj_method = "BH",
  retain_columns = NULL
)

Arguments

data

a data frame containing at least the input variables that are required for the selected method. Ideally the output of assign_missingness or impute is used.

sample

a character column in the data data frame that contains the sample name. Is not required if method = "t-test_mean_sd".

condition

a character or numeric column in the data data frame that contains the conditions.

grouping

a character column in the data data frame that contains precursor, peptide or protein identifiers.

intensity_log2

a numeric column in the data data frame that contains intensity values. The intensity values need to be log2 transformed. Is not required if method = "t-test_mean_sd".

missingness

a character column in the data data frame that contains missingness information. Can be obtained by calling assign_missingness(). Is not required if method = "t-test_mean_sd". The type of missingness assigned to a comparison does not have any influence on the statistical test. However, if filter_NA_missingness = TRUE and method = "proDA", then comparisons with missingness NA are filtered out prior to p-value adjustment.

comparison

a character column in the data data frame that contains information of treatment/reference condition pairs. Can be obtained by calling assign_missingness. Comparisons need to be in the form condition1_vs_condition2, meaning two compared conditions are separated by "_vs_". This column determines for which condition pairs differential abundances are calculated. Is not required if method = "t-test_mean_sd", in that case please provide a reference condition with the ref_condition argument.

mean

a numeric column in the data data frame that contains mean values for two conditions. Is only required if method = "t-test_mean_sd".

sd

a numeric column in the data data frame that contains standard deviations for two conditions. Is only required if method = "t-test_mean_sd".

n_samples

a numeric column in the data data frame that contains the number of samples per condition for two conditions. Is only required if method = "t-test_mean_sd".

ref_condition

optional, character value providing the condition that is used as a reference for differential abundance calculation. Only required for method = "t-test_mean_sd". Instead of providing one reference condition, "all" can be supplied, which will create all pairwise condition pairs. By default ref_condition = "all".

filter_NA_missingness

a logical value, default is TRUE. For all methods except "t-test_mean_sd" missingness information has to be provided. This information can be for example obtained by calling assign_missingness(). If a reference/treatment pair has too few samples to be considered robust based on user defined cutoffs, it is annotated with NA as missingness by the assign_missingness() function. If this argument is TRUE, these NA reference/treatment pairs are filtered out. For method = "proDA" this is done before the p-value adjustment.

method

a character value, specifies the method used for statistical hypothesis testing. Methods include Welch test ("t-test"), a Welch test on means, standard deviations and number of replicates ("t-test_mean_sd") and a moderated t-test based on the limma package ("moderated_t-test"). More information on the moderated t-test can be found in the limma documentation. Furthermore, the proDA package specific method ("proDA") can be used to infer means across samples based on a probabilistic dropout model. This eliminates the need for data imputation since missing values are inferred from the model. More information can be found in the proDA documentation. We do not recommend using the moderated_t-test or proDA method if the data was filtered for low CVs or imputation was performed. Default is method = "moderated_t-test".

p_adj_method

a character value, specifies the p-value correction method. Possible methods are c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Default method is "BH".

retain_columns

a vector indicating if certain columns should be retained from the input data frame. Default is not retaining additional columns retain_columns = NULL. Specific columns can be retained by providing their names (not in quotations marks, just like other column names, but in a vector). Please note that if you retain columns that have multiple rows per grouped variable there will be duplicated rows in the output.

Value

A data frame that contains differential abundances (diff), p-values (pval) and adjusted p-values (adj_pval) for each protein, peptide or precursor (depending on the grouping variable) and the associated treatment/reference pair. Depending on the method the data frame contains additional columns:

For all methods execept "proDA", the p-value adjustment is performed only on the proportion of data that contains a p-value that is not NA. For "proDA" the p-value adjustment is either performed on the complete dataset (filter_NA_missingness = TRUE) or on the subset of the dataset with missingness that is not NA (filter_NA_missingness = FALSE).

Examples

set.seed(123) # Makes example reproducible

# Create synthetic data
data <- create_synthetic_data(
  n_proteins = 10,
  frac_change = 0.5,
  n_replicates = 4,
  n_conditions = 2,
  method = "effect_random",
  additional_metadata = FALSE
)

# Assign missingness information
data_missing <- assign_missingness(
  data,
  sample = sample,
  condition = condition,
  grouping = peptide,
  intensity = peptide_intensity_missing,
  ref_condition = "all",
  retain_columns = c(protein, change_peptide)
)

# Calculate differential abundances
# Using "moderated_t-test" and "proDA" improves
# true positive recovery progressively
diff <- calculate_diff_abundance(
  data = data_missing,
  sample = sample,
  condition = condition,
  grouping = peptide,
  intensity_log2 = peptide_intensity_missing,
  missingness = missingness,
  comparison = comparison,
  method = "t-test",
  retain_columns = c(protein, change_peptide)
)

head(diff, n = 10)

[Package protti version 0.8.0 Index]