capture_first_df {nc} | R Documentation |
Capture first match in columns of a data frame
Description
Capture first matching text from one or more character columns of a data frame, using a different regular expression for each column.
Usage
capture_first_df(...,
nomatch.error = getOption("nc.nomatch.error",
TRUE), existing.error = getOption("nc.existing.error",
TRUE), engine = getOption("nc.engine",
"PCRE"))
Arguments
... |
subject data frame, colName1=list(groupName1=pattern1, fun1, etc),
colName2=list(etc), etc. First argument must be a data frame with
one or more character columns of subjects for matching. If the
first argument is a data table then it will be modified using
|
nomatch.error |
if TRUE (default), stop with an error if any subject does not match; otherwise subjects that do not match are reported as missing/NA rows of the result. |
existing.error |
if TRUE (default to avoid data loss), stop with an error if any capture groups have the same name as an existing column of subject. |
engine |
character string, one of PCRE, ICU, RE2. This |
Value
data.table with same number of rows as subject, with an additional
column for each named capture group
specified in ...
Author(s)
Toby Hocking <toby.hocking@r-project.org> [aut, cre]
Examples
## The JobID column can be match with a complicated regular
## expression, that we will build up from small sub-pattern list
## variables that are easy to understand independently.
(sacct.df <- data.frame(
JobID = c(
"13937810_25", "13937810_25.batch",
"13937810_25.extern", "14022192_[1-3]", "14022204_[4]"),
Elapsed = c(
"07:04:42", "07:04:42", "07:04:49",
"00:00:00", "00:00:00"),
stringsAsFactors=FALSE))
## Just match the end of the range.
int.pattern <- list("[0-9]+", as.integer)
end.pattern <- list(
"-",
task.end=int.pattern)
nc::capture_first_df(sacct.df, JobID=list(
end.pattern, nomatch.error=FALSE))
## Match the whole range inside square brackets.
range.pattern <- list(
"[[]",
task.start=int.pattern,
end.pattern, "?", #end is optional.
"[]]")
nc::capture_first_df(sacct.df, JobID=list(
range.pattern, nomatch.error=FALSE))
## Match either a single task ID or a range, after an underscore.
task.pattern <- list(
"_",
list(
task.id=int.pattern,
"|",#either one task(above) or range(below)
range.pattern))
nc::capture_first_df(sacct.df, JobID=task.pattern)
## Match type suffix alone.
type.pattern <- list(
"[.]",
type=".*")
nc::capture_first_df(sacct.df, JobID=list(
type.pattern, nomatch.error=FALSE))
## Match task and optional type suffix.
task.type.pattern <- list(
task.pattern,
type.pattern, "?")
nc::capture_first_df(sacct.df, JobID=task.type.pattern)
## Match full JobID and Elapsed columns.
nc::capture_first_df(
sacct.df,
JobID=list(
job=int.pattern,
task.type.pattern),
Elapsed=list(
hours=int.pattern,
":",
minutes=int.pattern,
":",
seconds=int.pattern))
## If input is data table then it is modified for memory efficiency,
## to avoid copying entire table.
sacct.DT <- data.table::as.data.table(sacct.df)
nc::capture_first_df(sacct.df, JobID=task.pattern)
sacct.df #not modified.
nc::capture_first_df(sacct.DT, JobID=task.pattern)
sacct.DT #modified!