R: Filter Data Rows and Subset Data Columns

filterInputData {NAIR}

R Documentation

Filter Data Rows and Subset Data Columns

Description

Given a data frame with a column containing receptor sequences, filter data rows by sequence length and sequence content. Keep all data columns or choose which columns to keep.

Usage

filterInputData(
  data,
  seq_col,
  min_seq_length = NULL,
  drop_matches = NULL,
  subset_cols = NULL,
  count_col = deprecated(),
  verbose = FALSE
)

Arguments

`data`	A data frame.
`seq_col`	Specifies the column(s) of `data` containing the receptor sequences. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. Each column specified will be coerced to a character vector. Data rows containing a value of `NA` in any of the specified columns will be dropped.
`min_seq_length`	Observations whose receptor sequences have fewer than `min_seq_length` characters are dropped.
`drop_matches`	Accepts a character string containing a regular expression (see `regex`). Checks values in the receptor sequence column for a pattern match using `grep()`. Rows in which a match is found are dropped.
`subset_cols`	Specifies which columns of the AIRR-Seq data are included in the output. Accepts a character vector of column names or a numeric vector of column indices. The default `NULL` includes all columns. The receptor sequence column is always included regardless of this argument's value.
`count_col`	Does nothing.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Value

A data frame.

Author(s)

Brian Neal (Brian.Neal@ucsf.edu)

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Examples

set.seed(42)
raw_data <- simulateToyData()

# Remove sequences shorter than 13 characters,
# as well as sequences containing the subsequence "GGGG".
# Keep variables for clone sequence, clone frequency and sample ID
filterInputData(
  raw_data,
  seq_col = "CloneSeq",
  min_seq_length = 13,
  drop_matches = "GGGG",
  subset_cols =
    c("CloneSeq", "CloneFrequency", "SampleID"),
  verbose = TRUE
)

[Package NAIR version 1.0.4 Index]