capture_first_vec {nc} | R Documentation |
Capture first match in each character vector element
Description
Use a regular expression (regex) with capture groups to extract
the first matching text from each of several subject strings. For
all matches in one multi-line text file or string use
capture_all_str
. For the first match in every row of a data.frame,
using a different regex for each column, use capture_first_df
. For
reading regularly named files, use capture_first_glob
. For
matching column names in a wide data frame and then
melting/reshaping those columns to a taller/longer data frame, see
capture_melt_single
and capture_melt_multiple
. To simplify the
definition of the regex you can use field
, quantifier
, and
alternatives
.
Usage
capture_first_vec(...,
nomatch.error = getOption("nc.nomatch.error",
TRUE), engine = getOption("nc.engine",
"PCRE"))
Arguments
... |
subject, name1=pattern1, fun1, etc. The first argument must be a
character vector of length>0 (subject strings to parse with a
regex). Arguments after the first specify the regex/conversion and
must be string/list/function. All character strings are pasted
together to obtain the final regex used for matching. Each
string/list with a named argument in R becomes a capture |
nomatch.error |
if TRUE (default), stop with an error if any subject does not match; otherwise subjects that do not match are reported as missing/NA rows of the result. |
engine |
character string, one of PCRE, ICU, RE2 |
Value
data.table with one row for each subject, and one column for each
capture group
.
Author(s)
Toby Hocking <toby.hocking@r-project.org> [aut, cre]
Examples
chr.pos.vec <- c(
"chr10:213,054,000-213,055,000",
"chrM:111,000",
"chr1:110-111 chr2:220-222") # two possible matches.
## Find the first match in each element of the subject character
## vector. Named argument values are used to create capture groups
## in the generated regex, and argument names become column names in
## the result.
(dt.chr.cols <- nc::capture_first_vec(
chr.pos.vec,
chrom="chr.*?",
":",
chromStart="[0-9,]+"))
## Even when no type conversion functions are specified, the result
## is always a data.table:
str(dt.chr.cols)
## Conversion functions are used to convert the previously named
## group, and patterns may be saved in lists for re-use.
keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
int.pattern <- list("[0-9,]+", keep.digits)
range.pattern <- list(
chrom="chr.*?",
":",
chromStart=int.pattern,
list( # un-named list becomes non-capturing group.
"-",
chromEnd=int.pattern
), "?") # chromEnd is optional.
(dt.int.cols <- nc::capture_first_vec(
chr.pos.vec, range.pattern))
## Conversion functions used to create non-char columns.
str(dt.int.cols)
## NA used to indicate no match or missing subject.
na.vec <- c(
"this will not match",
NA, # neither will this.
chr.pos.vec)
nc::capture_first_vec(na.vec, range.pattern, nomatch.error=FALSE)