scale_features_lm {SIPmg}R Documentation

Scale feature coverage values to estimate their absolute abundance

Description

Calculates global scaling factors for features (contigs or bins),based on linear regression of sequin coverage. Options include log-transformations of coverage, as well as filtering features based on limit of detection. This function must be called first, before the feature abundance table, feature detection table, and plots are retrieved.

Usage

scale_features_lm(
  f_tibble,
  sequin_meta,
  seq_dilution,
  log_trans = TRUE,
  coe_of_variation = 250,
  lod_limit = 0,
  save_plots = TRUE,
  plot_dir = tempdir(),
  cook_filtering = TRUE
)

Arguments

f_tibble

Can be either of (1) a tibble with first column "Feature" that contains bin IDs, and the rest of the columns represent samples with bins' coverage values. (2) a tibble as outputted by the program "checkm coverage" from the tool CheckM. Please check CheckM documentation - https://github.com/Ecogenomics/CheckM on the usage for "checkm coverage" program

sequin_meta

tibble containing sequin names ("Feature column") and concentrations in attamoles/uL ("Concentration") column.

seq_dilution

tibble with first column "Sample" with same sample names as in f_tibble, and a second column "Dilution" showing ratio of sequins added to final sample volume (e.g. a value of 0.01 for a dilution of 1 volume sequin to 99 volumes sample)

log_trans

Boolean (TRUE or FALSE), should coverages and sequin concentrations be log-scaled?

coe_of_variation

Acceptable coefficient of variation for coverage and detection (eg. 20 - for 20 % threshold of coefficient of variation). Coverages above the threshold value will be flagged in the plots.

lod_limit

(Decimal range 0-1) Threshold for the percentage of minimum detected sequins per concentration group. Default = 0

save_plots

Boolean (TRUE or FALSE), should sequin scaling be saved? Default = TRUE

plot_dir

Directory where plots are to be saved. Will create a directory "sequin_scaling_plots_lm" if it does not exist.

cook_filtering

Boolean (TRUE or FALSE), should data points be filtered based on Cook's distance metric. Cooks distance can be useful in detecting influential outliers in an ordinary least square’s regression model, which can negatively influence the model. A threshold of Cooks distance of 4/n (where n is the sample size) is chosen, and any data point with Cooks distance > 4/n is filtered out. It is typical to choose 4/n as the threshold in detecting the outliers in the data. Default = TRUE

Value

a list of tibbles containing

Examples

data(f_tibble, sequins, seq_dil)



### scaling sequins from coverage values
scaled_features_lm = scale_features_lm(f_tibble,sequin_meta, seq_dil)



[Package SIPmg version 1.4.1 Index]