| SeqId {SomaDataIO} | R Documentation |
Working with SomaLogic SeqIds
Description
The SeqId is the cornerstone used to uniquely identify
SomaLogic analytes.
SeqIds follow the format <Pool>-<Clone>_<Version>, for example
"1234-56_7" can be represented as:
| Pool | Clone | Version |
1234 | 56 | 7
|
See Details below for the definition of each sub-unit.
The <Pool>-<Clone> combination is sufficient to uniquely identify a
specific analyte and therefore versions are no longer provided (though
they may be present in legacy ADATs).
The tools below enable users to extract, test, identify, compare,
and manipulate SeqIds across assay runs and/or versions.
Usage
getSeqId(x, trim.version = FALSE)
regexSeqId()
locateSeqId(x, trailing = TRUE)
seqid2apt(x)
apt2seqid(x)
is.apt(x)
is.SeqId(x)
matchSeqIds(x, y, order.by.x = TRUE)
getSeqIdMatches(x, y, show = FALSE)
Arguments
x |
Character. A vector of strings, usually analyte/feature column
names, |
trim.version |
Logical. Whether to remove the version number, i.e. "1234-56_7" -> "1234-56". Primarily for legacy ADATs. |
trailing |
Logical. Should the regular expression explicitly specify
trailing |
y |
Character. A second vector of |
order.by.x |
Logical. Order the returned character string by
the |
show |
Logical. Return the data frame visibly? |
Details
| Pool: | ties back to the original well during SELEX |
| Clone: | ties to the specific sequence within a pool |
| Version: | refers to custom modifications (optional/defunct) |
AptNamea
SeqIdcombined with a string, usually aGeneId- orseq.-prefix, for convenient, human-readable manipulation from withinR.
Value
getSeqId(): a character vector of SeqIds captured from a string.
regexSeqId(): a regular expression (regex) string
pre-defined to match SomaLogic the SeqId pattern.
locateSeqId(): a data frame containing the start and stop
integer positions for SeqId matches at each value of x.
seqid2apt(): a character vector with the seq.* prefix, i.e.
the inverse of getSeqId().
apt2seqid(): a character vector of SeqIds. is.SeqId() will
return TRUE for all elements.
is.apt(), is.SeqId(): Logical. TRUE or FALSE.
matchSeqIds(): a character string corresponding to values
in y of the intersect of x and y. If no matches are
found, character(0).
getSeqIdMatches(): a n x 2 data frame, where n is the
length of the intersect of the matching SeqIds.
The data frame is named by the passed arguments, x and y.
Functions
-
getSeqId(): extracts/captures the theSeqIdmatch from an analyte column identifier, i.e. column name of an ADAT loaded withread_adat(). Assumes theSeqIdpattern occurs at the end of the string, which for the vast majority of cases will be true. For edge cases, see thetrailingargument tolocateSeqId(). -
regexSeqId(): generates a pre-formatted regular expression for matching ofSeqIds. Note the trailing match, which is most commonly required, butlocateSeqId()offers an alternative to mach anywhere in a string. Used internally in many utility functions -
locateSeqId(): generates a data frame of the positionalSeqIdmatches. Specifically designed to facilitateSeqIdextraction viasubstr(). Similar tostringr::str_locate(). -
seqid2apt(): converts aSeqIdinto anonymous-AptName format, i.e.1234-56->seq.1234.56. Version numbers (1234-56_ver) are always trimmed when present. -
apt2seqid(): converts an anonymous-AptName intoSeqIdformat, i.e.seq.1234.56->1234-56. Version numbers (seq.1234.56.ver) are always trimmed when present. -
is.apt(): regular expression match to determine if a string contains aSeqId, and thus is probably anAptNameformat string. Both legacyEntrezGeneSymbol-SeqIdcombinations or newer so-called"anonymous-AptNames"formats (seq.1234.45) are matched. -
is.SeqId(): tests forSeqIdformat, i.e. values returned fromgetSeqId()will always returnTRUE. -
matchSeqIds(): matches two character vectors on the basis of their intersectingSeqIds. Note that elements inynot containing aSeqIdregular expression are silently dropped. -
getSeqIdMatches(): matches two character vectors on the basis of their intersecting SeqIds only (irrespective of theGeneID-prefix). This produces a two-column data frame which then can be used as to map between the two sets.The final order of the matches/rows is by the input corresponding to the first argument (
x).By default the data frame is invisibly returned to avoid dumping excess output to the console (see the
show =argument.)
Author(s)
Stu Field
See Also
Examples
x <- c("ABDC.3948.48.2", "3948.88",
"3948.48.2", "3948-48_2", "3948.48.2",
"3948-48_2", "3948-88",
"My.Favorite.Apt.3948.88.9")
tibble::tibble(orig = x,
SeqId = getSeqId(x),
SeqId_trim = getSeqId(x, TRUE),
AptName = seqid2apt(SeqId))
# Logical Matching
is.apt("AGR2.4959.2") # TRUE
is.apt("seq.4959.2") # TRUE
is.apt("4959-2") # TRUE
is.apt("AGR2") # FALSE
# SeqId Matching
x <- c("seq.4554.56", "seq.3714.49", "PlateId")
y <- c("Group", "3714-49", "Assay", "4554-56")
matchSeqIds(x, y)
matchSeqIds(x, y, order.by.x = FALSE)
# vector of features
feats <- getAnalytes(example_data)
match_df <- getSeqIdMatches(feats[1:100], feats[90:500]) # 11 overlapping
match_df
a <- utils::head(feats, 15)
b <- withr::with_seed(99, sample(getSeqId(a))) # => SeqId & shuffle
(getSeqIdMatches(a, b)) # sorted by first vector "a"