R: Network Analysis of Immune Repertoire

buildRepSeqNetwork {NAIR}

R Documentation

Network Analysis of Immune Repertoire

Description

Given Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, builds the network graph for the immune repertoire based on sequence similarity, computes specified network properties and generates customized visualizations.

buildNet() is identical to buildRepSeqNetwork(), existing as an alias for convenience.

Usage

buildRepSeqNetwork(

  ## Input ##
  data,
  seq_col,
  count_col = NULL,
  subset_cols = NULL,
  min_seq_length = 3,
  drop_matches = NULL,

  ## Network ##
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  node_stats = FALSE,
  stats_to_include = chooseNodeStats(),
  cluster_stats = FALSE,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",

  ## Visualization ##
  plots = TRUE,
  print_plots = FALSE,
  plot_title = "auto",
  plot_subtitle = "auto",
  color_nodes_by = "auto",
  ...,

  ## Output ##
  output_dir = NULL,
  output_type = "rds",
  output_name = "MyRepSeqNetwork",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE

)

# Alias for buildRepSeqNetwork()
buildNet(
  data,
  seq_col,
  count_col = NULL,
  subset_cols = NULL,
  min_seq_length = 3,
  drop_matches = NULL,
  dist_type = "hamming",
  dist_cutoff = 1,
  drop_isolated_nodes = TRUE,
  node_stats = FALSE,
  stats_to_include = chooseNodeStats(),
  cluster_stats = FALSE,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id",
  plots = TRUE,
  print_plots = FALSE,
  plot_title = "auto",
  plot_subtitle = "auto",
  color_nodes_by = "auto",
  ...,
  output_dir = NULL,
  output_type = "rds",
  output_name = "MyRepSeqNetwork",
  pdf_width = 12,
  pdf_height = 10,
  verbose = FALSE

)

Arguments

`data`	A data frame containing the AIRR-Seq data, with variables indexed by column and observations (e.g., clones or cells) indexed by row.
`seq_col`	Specifies the column(s) of `data` containing the receptor sequences to be used as the basis of similarity between rows. Accepts a character string containing the column name or a numeric scalar containing the column index. Also accepts a vector of length 2 specifying distinct sequence columns (e.g., alpha chain and beta chain), in which case similarity between rows depends on similarity in both sequence columns (see details).
`count_col`	Optional. Specifies the column of `data` containing a measure of abundance, e.g., clone count or unique molecular identifier (UMI) count. Accepts either the column name or column index. Passed to `addClusterStats()`; only relevant if `cluster_stats = TRUE`.
`subset_cols`	Specifies which columns of the AIRR-Seq data are included in the output. Accepts a vector of column names or a vector of column indices. The default `NULL` includes all columns. The receptor sequence column is always included regardless of this argument's value. Passed to `filterInputData()`.
`min_seq_length`	A numeric scalar, or `NULL`. Observations whose receptor sequences have fewer than `min_seq_length` characters are removed prior to network analysis.
`drop_matches`	Optional. Passed to `filterInputData()`. Accepts a character string containing a regular expression (see `regex`). Checks receptor sequences for a pattern match using `grep()`. Those returning a match are removed prior to network analysis.
`dist_type`	Specifies the function used to quantify the similarity between sequences. The similarity between two sequences determines the pairwise distance between their respective nodes in the network graph, with greater similarity corresponding to shorter distance. Valid options are `"hamming"` (the default), which uses `hamDistBounded()`, and `"levenshtein"`, which uses `levDistBounded()`.
`dist_cutoff`	A nonnegative scalar. Specifies the maximum pairwise distance (based on `dist_type`) for an edge connection to exist between two nodes. Pairs of nodes whose distance is less than or equal to this value will be joined by an edge connection in the network graph. Controls the stringency of the network construction and affects the number and density of edges in the network. A lower cutoff value requires greater similarity between sequences in order for their respective nodes to be joined by an edge connection. A value of `0` requires two sequences to be identical in order for their nodes to be joined by an edge.
`drop_isolated_nodes`	A logical scalar. When `TRUE`, removes each node that is not joined by an edge connection to any other node in the network graph.
`node_stats`	A logical scalar. Specifies whether node-level network properties are computed.
`stats_to_include`	A named logical vector returned by `chooseNodeStats()` or `exclusiveNodeStats()`. Specifies the node-level network properties to compute. Also accepts the value `"all"`. Only relevant if `node_stats = TRUE`.
`cluster_stats`	A logical scalar. Specifies whether to compute cluster-level network properties.
`cluster_fun`	Passed to `addClusterMembership()`. Specifies the clustering algorithm used when cluster analysis is performed. Cluster analysis is performed when `cluster_stats = TRUE` or when `node_stats = TRUE` with the `cluster_id` property enabled via the `stats_to_include` argument.
`cluster_id_name`	Passed to `addClusterMembership()`. Specifies the name of the cluster membership variable added to the node metadata when cluster analysis is performed (see `cluster_fun`).
`plots`	A logical scalar. Specifies whether to generate plots of the network graph.
`print_plots`	A logical scalar. If `plots = TRUE`, specifies whether the plots should be printed to the R plotting window.
`plot_title`	A character string or `NULL`. If `plots = TRUE`, this is the title used for each plot. The default value `"auto"` generates the title based on the value of the `output_name` argument.
`plot_subtitle`	A character string or `NULL`. If `plots = TRUE`, this is the subtitle used for each plot. The default value `"auto"` generates a subtitle based on the values of the `dist_type` and `dist_cutoff` arguments.
`color_nodes_by`	Optional. Specifies a variable to be used as metadata for coloring the nodes in the network graph plot. Accepts a character string. This can be a column name of `data` or (if `node_stats = TRUE`) the name of a computed node-level network property (based on `stats_to_include`). Also accepts a character vector specifying multiple variables, in which case one plot will be generated for each variable. The default value `"auto"` attempts to use one of several potential variables to color the nodes, depending on what is available. A value of `NULL` leaves the nodes uncolored.
`...`	Other named arguments to `addPlots()`.
`output_dir`	A file path specifying the directory for saving the output. The directory will be created if it does not exist. If `NULL`, output will be returned but not saved.
`output_type`	A character string specifying the file format to use when saving the output. The default value `"individual"` saves each element of the returned list as an individual uncompressed file, with data frames saved in csv format. For better compression, the values `"rda"` and `"rds"` save the returned list as a single file using the rda and rds format, respectively (in the former case, the list will be named `net` within the rda file). Regardless of the argument value, any plots generated will saved to a pdf file containing one plot per page.
`output_name`	A character string. All files saved will have file names beginning with this value.
`pdf_width`	Sets the width of each plot when writing to pdf. Passed to `saveNetwork()`.
`pdf_height`	Sets the height of each plot when writing to pdf. Passed to `saveNetwork()`.
`verbose`	Logical. If `TRUE`, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to `stderr()`.

Details

To construct the immune repertoire network, each TCR/BCR clone (bulk data) or cell (single-cell data) is modeled as a node in the network graph, corresponding to a single row of the AIRR-Seq data. For each node, the corresponding receptor sequence is considered. Both nucleotide and amino acid sequences are supported for this purpose. The receptor sequence is used as the basis of similarity and distance between nodes in the network.

Similarity between sequences is measured using either the Hamming distance or Levenshtein (edit) distance. The similarity determines the pairwise distance between nodes in the network graph. The more similar two sequences are, the shorter the distance between their respective nodes. Two nodes in the graph are joined by an edge if the distance between them is sufficiently small, i.e., if their receptor sequences are sufficiently similar.

For single-cell data, edge connections between nodes can be based on similarity in both the alpha chain and beta chain sequences. This is done by providing a vector of length 2 to seq_cols specifying the two sequence columns in data. The distance between two nodes is then the greater of the two distances between sequences in corresponding chains. Two nodes will be joined by an edge if their alpha chain sequences are sufficiently similar and their beta chain sequences are sufficiently similar.

See the buildRepSeqNetwork package vignette for more details. The vignette can be accessed offline using vignette("buildRepSeqNetwork").

Value

If the constructed network contains no nodes, the function will return NULL, invisibly, with a warning. Otherwise, the function invisibly returns a list containing the following items:

`details`	A list containing information about the network and the settings used during its construction.
`igraph`	An object of class `igraph` containing the list of nodes and edges for the network graph.
`adjacency_matrix`	The network graph adjacency matrix, stored as a sparse matrix of class `dgCMatrix` from the `Matrix` package. See `dgCMatrix-class`.
`node_data`	A data frame containing containing metadata for the network nodes, where each row corresponds to a node in the network graph. This data frame contains all variables from `data` (unless otherwise specified via `subset_cols`) in addition to the computed node-level network properties if `node_stats = TRUE`. Each row's name is the name of the corresponding row from `data`.
`cluster_data`	A data frame containing network properties for the clusters, where each row corresponds to a cluster in the network graph. Only included if `cluster_stats = TRUE`.
`plots`	A list containing one element for each plot generated as well as an additional element for the matrix that specifies the graph layout. Each plot is an object of class `ggraph`. Only included if `plots = TRUE`.

Author(s)

Brian Neal (Brian.Neal@ucsf.edu)

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

buildRepSeqNetwork vignette

Examples

set.seed(42)
toy_data <- simulateToyData()

# Simple call
network = buildNet(
  toy_data,
  seq_col = "CloneSeq",
  print_plots = TRUE
)

# Customized:
network <- buildNet(
  toy_data, "CloneSeq",
  dist_type = "levenshtein",
  node_stats = TRUE,
  cluster_stats = TRUE,
  cluster_fun = "louvain",
  cluster_id_name = "cluster_membership",
  count_col = "CloneCount",
  color_nodes_by = c("SampleID", "cluster_membership", "coreness"),
  color_scheme = c("default", "Viridis", "plasma-1"),
  size_nodes_by = "degree",
  node_size_limits = c(0.1, 1.5),
  plot_title = NULL,
  plot_subtitle = NULL,
  print_plots = TRUE,
  verbose = TRUE
)

typeof(network)

names(network)

network$details

head(network$node_data)

head(network$cluster_data)

[Package NAIR version 1.0.4 Index]