SeqId {SomaDataIO} | R Documentation |
Working with SomaLogic SeqIds
Description
The SeqId
is the cornerstone used to uniquely identify
SomaLogic analytes.
SeqIds
follow the format <Pool>-<Clone>_<Version>
, for example
"1234-56_7"
can be represented as:
Pool | Clone | Version |
1234 | 56 | 7
|
See Details below for the definition of each sub-unit.
The <Pool>-<Clone>
combination is sufficient to uniquely identify a
specific analyte and therefore versions are no longer provided (though
they may be present in legacy ADATs).
The tools below enable users to extract, test, identify, compare,
and manipulate SeqIds
across assay runs and/or versions.
Usage
getSeqId(x, trim.version = FALSE)
regexSeqId()
locateSeqId(x, trailing = TRUE)
seqid2apt(x)
apt2seqid(x)
is.apt(x)
is.SeqId(x)
matchSeqIds(x, y, order.by.x = TRUE)
getSeqIdMatches(x, y, show = FALSE)
Arguments
x |
Character. A vector of strings, usually analyte/feature column
names, |
trim.version |
Logical. Whether to remove the version number, i.e. "1234-56_7" -> "1234-56". Primarily for legacy ADATs. |
trailing |
Logical. Should the regular expression explicitly specify
trailing |
y |
Character. A second vector of |
order.by.x |
Logical. Order the returned character string by
the |
show |
Logical. Return the data frame visibly? |
Details
Pool: | ties back to the original well during SELEX |
Clone: | ties to the specific sequence within a pool |
Version: | refers to custom modifications (optional/defunct) |
AptName
a
SeqId
combined with a string, usually aGeneId
- orseq.
-prefix, for convenient, human-readable manipulation from withinR
.
Value
getSeqId()
: a character vector of SeqIds
captured from a string.
regexSeqId()
: a regular expression (regex
) string
pre-defined to match SomaLogic the SeqId
pattern.
locateSeqId()
: a data frame containing the start
and stop
integer positions for SeqId
matches at each value of x
.
seqid2apt()
: a character vector with the seq.*
prefix, i.e.
the inverse of getSeqId()
.
apt2seqid()
: a character vector of SeqIds
. is.SeqId()
will
return TRUE
for all elements.
is.apt()
, is.SeqId()
: Logical. TRUE
or FALSE
.
matchSeqIds()
: a character string corresponding to values
in y
of the intersect of x
and y
. If no matches are
found, character(0)
.
getSeqIdMatches()
: a n x 2
data frame, where n
is the
length of the intersect of the matching SeqIds
.
The data frame is named by the passed arguments, x
and y
.
Functions
-
getSeqId()
: extracts/captures the theSeqId
match from an analyte column identifier, i.e. column name of an ADAT loaded withread_adat()
. Assumes theSeqId
pattern occurs at the end of the string, which for the vast majority of cases will be true. For edge cases, see thetrailing
argument tolocateSeqId()
. -
regexSeqId()
: generates a pre-formatted regular expression for matching ofSeqIds
. Note the trailing match, which is most commonly required, butlocateSeqId()
offers an alternative to mach anywhere in a string. Used internally in many utility functions -
locateSeqId()
: generates a data frame of the positionalSeqId
matches. Specifically designed to facilitateSeqId
extraction viasubstr()
. Similar tostringr::str_locate()
. -
seqid2apt()
: converts aSeqId
into anonymous-AptName format, i.e.1234-56
->seq.1234.56
. Version numbers (1234-56_ver
) are always trimmed when present. -
apt2seqid()
: converts an anonymous-AptName intoSeqId
format, i.e.seq.1234.56
->1234-56
. Version numbers (seq.1234.56.ver
) are always trimmed when present. -
is.apt()
: regular expression match to determine if a string contains aSeqId
, and thus is probably anAptName
format string. Both legacyEntrezGeneSymbol-SeqId
combinations or newer so-called"anonymous-AptNames"
formats (seq.1234.45
) are matched. -
is.SeqId()
: tests forSeqId
format, i.e. values returned fromgetSeqId()
will always returnTRUE
. -
matchSeqIds()
: matches two character vectors on the basis of their intersectingSeqIds
. Note that elements iny
not containing aSeqId
regular expression are silently dropped. -
getSeqIdMatches()
: matches two character vectors on the basis of their intersecting SeqIds only (irrespective of theGeneID
-prefix). This produces a two-column data frame which then can be used as to map between the two sets.The final order of the matches/rows is by the input corresponding to the first argument (
x
).By default the data frame is invisibly returned to avoid dumping excess output to the console (see the
show =
argument.)
Author(s)
Stu Field
See Also
Examples
x <- c("ABDC.3948.48.2", "3948.88",
"3948.48.2", "3948-48_2", "3948.48.2",
"3948-48_2", "3948-88",
"My.Favorite.Apt.3948.88.9")
tibble::tibble(orig = x,
SeqId = getSeqId(x),
SeqId_trim = getSeqId(x, TRUE),
AptName = seqid2apt(SeqId))
# Logical Matching
is.apt("AGR2.4959.2") # TRUE
is.apt("seq.4959.2") # TRUE
is.apt("4959-2") # TRUE
is.apt("AGR2") # FALSE
# SeqId Matching
x <- c("seq.4554.56", "seq.3714.49", "PlateId")
y <- c("Group", "3714-49", "Assay", "4554-56")
matchSeqIds(x, y)
matchSeqIds(x, y, order.by.x = FALSE)
# vector of features
feats <- getAnalytes(example_data)
match_df <- getSeqIdMatches(feats[1:100], feats[90:500]) # 11 overlapping
match_df
a <- utils::head(feats, 15)
b <- withr::with_seed(99, sample(getSeqId(a))) # => SeqId & shuffle
(getSeqIdMatches(a, b)) # sorted by first vector "a"