SeqId {SomaDataIO}R Documentation

Working with SomaLogic SeqIds

Description

The SeqId is the cornerstone used to uniquely identify SomaLogic analytes. SeqIds follow the format ⁠<Pool>-<Clone>_<Version>⁠, for example "1234-56_7" can be represented as:

Pool Clone Version
1234 56 7

See Details below for the definition of each sub-unit. The ⁠<Pool>-<Clone>⁠ combination is sufficient to uniquely identify a specific analyte and therefore versions are no longer provided (though they may be present in legacy ADATs). The tools below enable users to extract, test, identify, compare, and manipulate SeqIds across assay runs and/or versions.

Usage

getSeqId(x, trim.version = FALSE)

regexSeqId()

locateSeqId(x, trailing = TRUE)

seqid2apt(x)

apt2seqid(x)

is.apt(x)

is.SeqId(x)

matchSeqIds(x, y, order.by.x = TRUE)

getSeqIdMatches(x, y, show = FALSE)

Arguments

x

Character. A vector of strings, usually analyte/feature column names, AptNames, or SeqIds. For seqid2apt(), a vector of SeqIds. For apt2seqid(), a character vector containing SeqIds. For matchSeqIds(), a vector of pattern matches containing SeqIds. Can be AptNames with GeneIDs, the seq.XXXX format, or even "naked" SeqIds.

trim.version

Logical. Whether to remove the version number, i.e. "1234-56_7" -> "1234-56". Primarily for legacy ADATs.

trailing

Logical. Should the regular expression explicitly specify trailing SeqId pattern match, i.e. "regex$"? This is the most common case and the default.

y

Character. A second vector of AptNames containing SeqIds to match against those in contained in x. For matchSeqIds() these values are returned if there are matching elements.

order.by.x

Logical. Order the returned character string by the x (first) argument?

show

Logical. Return the data frame visibly?

Details

Pool: ties back to the original well during SELEX
Clone: ties to the specific sequence within a pool
Version: refers to custom modifications (optional/defunct)
AptName

a SeqId combined with a string, usually a GeneId- or seq.-prefix, for convenient, human-readable manipulation from within R.

Value

getSeqId(): a character vector of SeqIds captured from a string.

regexSeqId(): a regular expression (regex) string pre-defined to match SomaLogic the SeqId pattern.

locateSeqId(): a data frame containing the start and stop integer positions for SeqId matches at each value of x.

seqid2apt(): a character vector with the ⁠seq.*⁠ prefix, i.e. the inverse of getSeqId().

apt2seqid(): a character vector of SeqIds. is.SeqId() will return TRUE for all elements.

is.apt(), is.SeqId(): Logical. TRUE or FALSE.

matchSeqIds(): a character string corresponding to values in y of the intersect of x and y. If no matches are found, character(0).

getSeqIdMatches(): a n x 2 data frame, where n is the length of the intersect of the matching SeqIds. The data frame is named by the passed arguments, x and y.

Functions

Author(s)

Stu Field

See Also

intersect()

Examples

x <- c("ABDC.3948.48.2", "3948.88",
       "3948.48.2", "3948-48_2", "3948.48.2",
       "3948-48_2", "3948-88",
       "My.Favorite.Apt.3948.88.9")

tibble::tibble(orig       = x,
               SeqId      = getSeqId(x),
               SeqId_trim = getSeqId(x, TRUE),
               AptName    = seqid2apt(SeqId))

# Logical Matching
is.apt("AGR2.4959.2") # TRUE
is.apt("seq.4959.2")  # TRUE
is.apt("4959-2")      # TRUE
is.apt("AGR2")        # FALSE


# SeqId Matching
x <- c("seq.4554.56", "seq.3714.49", "PlateId")
y <- c("Group", "3714-49", "Assay", "4554-56")
matchSeqIds(x, y)
matchSeqIds(x, y, order.by.x = FALSE)

# vector of features
feats <- getAnalytes(example_data)

match_df <- getSeqIdMatches(feats[1:100], feats[90:500])  # 11 overlapping
match_df

a <- utils::head(feats, 15)
b <- withr::with_seed(99, sample(getSeqId(a)))   # => SeqId & shuffle
(getSeqIdMatches(a, b))                          # sorted by first vector "a"

[Package SomaDataIO version 6.1.0 Index]