R: Capture first match in columns of a data frame

capture_first_df {nc}

R Documentation

Capture first match in columns of a data frame

Description

Capture first matching text from one or more character columns of a data frame, using a different regular expression for each column.

Usage

capture_first_df(..., 
    nomatch.error = getOption("nc.nomatch.error", 
        TRUE), existing.error = getOption("nc.existing.error", 
        TRUE), engine = getOption("nc.engine", 
        "PCRE"))

Arguments

`...`	subject data frame, colName1=list(groupName1=pattern1, fun1, etc), colName2=list(etc), etc. First argument must be a data frame with one or more character columns of subjects for matching. If the first argument is a data table then it will be modified using `set` (for memory efficiency, to avoid copying the whole data table); otherwise the input data frame will be copied to a new data table. Each other argument must be named using a column name of the subject data frame, e.g. colName1, colName2. Each other argument value must be a list that specifies the regex/conversion to use (in string/function/list format as documented in `capture_first_vec`, which is used on each named column), and possibly a column-specific `engine` to use.
`nomatch.error`	if TRUE (default), stop with an error if any subject does not match; otherwise subjects that do not match are reported as missing/NA rows of the result.
`existing.error`	if TRUE (default to avoid data loss), stop with an error if any capture groups have the same name as an existing column of subject.
`engine`	character string, one of PCRE, ICU, RE2. This `engine` will be used for each column, unless another `engine` is specified for that column in `...`

Value

data.table with same number of rows as subject, with an additional column for each named capture group specified in ...

Author(s)

Toby Hocking <toby.hocking@r-project.org> [aut, cre]

Examples


## The JobID column can be match with a complicated regular
## expression, that we will build up from small sub-pattern list
## variables that are easy to understand independently.
(sacct.df <- data.frame(
  JobID = c(
    "13937810_25", "13937810_25.batch",
    "13937810_25.extern", "14022192_[1-3]", "14022204_[4]"),
  Elapsed = c(
    "07:04:42", "07:04:42", "07:04:49",
    "00:00:00", "00:00:00"),
  stringsAsFactors=FALSE))

## Just match the end of the range.
int.pattern <- list("[0-9]+", as.integer)
end.pattern <- list(
  "-",
  task.end=int.pattern)
nc::capture_first_df(sacct.df, JobID=list(
  end.pattern, nomatch.error=FALSE))

## Match the whole range inside square brackets.
range.pattern <- list(
  "[[]",
  task.start=int.pattern,
  end.pattern, "?", #end is optional.
  "[]]")
nc::capture_first_df(sacct.df, JobID=list(
  range.pattern, nomatch.error=FALSE))

## Match either a single task ID or a range, after an underscore.
task.pattern <- list(
  "_",
  list(
    task.id=int.pattern,
    "|",#either one task(above) or range(below)
    range.pattern))
nc::capture_first_df(sacct.df, JobID=task.pattern)

## Match type suffix alone.
type.pattern <- list(
  "[.]",
  type=".*")
nc::capture_first_df(sacct.df, JobID=list(
  type.pattern, nomatch.error=FALSE))

## Match task and optional type suffix.
task.type.pattern <- list(
  task.pattern,
  type.pattern, "?")
nc::capture_first_df(sacct.df, JobID=task.type.pattern)

## Match full JobID and Elapsed columns.
nc::capture_first_df(
  sacct.df,
  JobID=list(
    job=int.pattern,
    task.type.pattern),
  Elapsed=list(
    hours=int.pattern,
    ":",
    minutes=int.pattern,
    ":",
    seconds=int.pattern))

## If input is data table then it is modified for memory efficiency,
## to avoid copying entire table.
sacct.DT <- data.table::as.data.table(sacct.df)
nc::capture_first_df(sacct.df, JobID=task.pattern)
sacct.df #not modified.
nc::capture_first_df(sacct.DT, JobID=task.pattern)
sacct.DT #modified!

[Package nc version 2024.2.21 Index]