R: Find Public Clusters Among RepSeq Samples

findPublicClusters {NAIR}

R Documentation

Find Public Clusters Among RepSeq Samples

Description

Part of the workflow Searching for Public TCR/BCR Clusters.

Given multiple samples of bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, construct the repertoire network for each sample. Within each sample's network, perform cluster analysis and filter the clusters based on node count and aggregate clone count.

Usage

findPublicClusters(

  ## Input ##
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args,
  sample_ids =
    paste0("Sample", 1:length(file_list)),
  seq_col,
  count_col = NULL,

  ## Search Criteria ##
  min_seq_length = 3,
  drop_matches = "[*|_]",
  top_n_clusters = 20,
  min_node_count = 10,
  min_clone_count = 100,

  ## Optional Visualization ##
  plots = FALSE,
  print_plots = FALSE,
  plot_title = "auto",
  color_nodes_by = "cluster_id",

  ## Output ##
  output_dir,
  output_type = "rds",

  ## Optional Output ##
  output_dir_unfiltered = NULL,
  output_type_unfiltered = "rds",

  verbose = FALSE,

  ...

)

Arguments

`file_list`	A character vector of file paths, or a list containing `connections` and file paths. Each element corresponds to a single file containing the data for a single sample. Passed to `loadDataFromFileList()`.
`input_type`	A character string specifying the file format of the sample data files. Options are `"table"`, `"txt"`, `"tsv"`, `"csv"`, `"rds"` and `"rda"`. Passed to `loadDataFromFileList()`.
`data_symbols`	Used when `input_type = "rda"`. Specifies the name of each sample's data frame within its respective Rdata file. Passed to `loadDataFromFileList()`.
`header`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `header` argument to `read.table()`, `read.csv()`, etc.
`sep`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify a non-default value of the `sep` argument to `read.table()`, `read.csv()`, etc.
`read.args`	For values of `input_type` other than `"rds"` and `"rda"`, this argument can be used to specify non-default values of optional arguments to `read.table()`, `read.csv()`, etc. Accepts a named list of argument values. Values of `header` and `sep` in this list take precedence over values specified via the `header` and `sep` arguments.
`sample_ids`	A character or numeric vector of sample IDs, whose length matches that of `file_list`. The values should be valid for use as filenames and should avoid using the forward slash or backslash characters (`/` or `\`).
`seq_col`	Specifies the column of each sample's data frame containing the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
`count_col`	Specifies the column of each sample's data frame containing the clone count (measure of clonal abundance). Accepts a character string containing the column name or a numeric scalar containing the column index. If `NULL`, the clusters in each sample's network will be selected solely based upon node count.
`min_seq_length`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`drop_matches`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample. Accepts a character string containing a regular expression (see `regex`). Checks TCR/BCR sequences for a pattern match using `grep()`. Those returning a match are dropped. By default, sequences containing any of the characters `*`, `\|` or `_` are dropped.
`top_n_clusters`	The number of clusters from each sample to be automatically be included among the filtered clusters, based on greatest node count.
`min_node_count`	Clusters with at least this many nodes will be included among the filtered clusters.
`min_clone_count`	Clusters with an aggregate clone count of at least this value will be included among the filtered clusters. A value of `NULL` ignores this criterion and does not select additional clusters based on clone count.
`plots`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`print_plots`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`plot_title`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`color_nodes_by`	Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`output_dir`	The file path of the directory for saving the output. The directory will be created if it does not already exist.
`output_type`	A character string specifying the file format to use for saving the output. Valid options include `"csv"`, `"rds"` and `"rda"`.
`output_dir_unfiltered`	An optional directory for saving the unfiltered network data for each sample. By default, only the filtered results are saved.
`output_type_unfiltered`	A character string specifying the file format to use for saving the unfiltered network data for each sample. Only applicable if `output_dir_unfiltered` is non-null. Passed to `buildRepSeqNetwork()` when constructing the network for each sample.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.
`...`	Other arguments to `buildRepSeqNetwork` when constructing the network for each sample, not including `node_stats`, `stats_to_include`, `cluster_stats`, `cluster_id_name` or `output_name` (see details).

Details

Each sample's network is constructed using an individual call to buildNet() with node_stats = TRUE, stats_to_include = "all", cluster_stats = TRUE and cluster_id_name = "ClusterIDInSample". The node-level properties are renamed to reflect their correspondence to the sample-level network. Specifically, the properties are named:

SampleLevelNetworkDegree
SampleLevelTransitivity
SampleLevelCloseness
SampleLevelCentralityByCloseness
SampleLevelCentralityByEigen
SampleLevelEigenCentrality
SampleLevelBetweenness
SampleLevelCentralityByBetweenness
SampleLevelAuthorityScore
SampleLevelCoreness
SampleLevelPageRank

A variable SampleID is added to both the node-level and cluster-level meta data for each sample.

After the clusters in each sample are filtered, the node-level and cluster-level metadata are saved in the respective subdirectories node_meta_data and cluster_meta_data of the output directory specified by output_dir.

The unfiltered network results for each sample can also be saved by supplying a directory to output_dir_unfiltered, if these results are desired for downstream analysis. Each sample's unfiltered network results will then be saved to its own subdirectory created within this directory.

The files containing the node-level metadata for the filtered clusters can be supplied to buildPublicClusterNetwork() in order to construct a global network of public clusters. If the full global network is too large to practically construct, the files containing the cluster-level meta data for the filtered clusters can be supplied to buildPublicClusterNetworkByRepresentative() to build a global network using only a single representative sequence from each cluster. This allows prominent public clusters to still be identified.

See the Searching for Public TCR/BCR Clusters article on the package website.

Value

Returns TRUE, invisibly.

Author(s)

Brian Neal (Brian.Neal@ucsf.edu)

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Searching for Public TCR/BCR Clusters vignette

Examples

set.seed(42)

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
  samples = samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:samples, ".rds")
  )
findPublicClusters(
  file_list = sample_files,
  input_type = "rds",
  seq_col = "CloneSeq",
  count_col = "CloneCount",
  min_seq_length = NULL,
  drop_matches = NULL,
  top_n_clusters = 3,
  min_node_count = 5,
  min_clone_count = 15000,
  output_dir = tempdir()
)

[Package NAIR version 1.0.4 Index]