strapply {gsubfn} | R Documentation |
Apply a function over a string or strings.
Description
Similar to "'gsubfn'"
except instead of performing substitutions
it returns the output of "'FUN'"
.
Usage
strapply(X, pattern, FUN = function(x, ...) x, backref, ..., empty,
ignore.case = FALSE, perl = FALSE, engine,
simplify = FALSE, USE.NAMES, combine = c)
strapplyc(X, pattern, backref, ignore.case = FALSE, simplify = FALSE, USE.NAMES, engine)
Arguments
X |
list or (atomic) vector of character strings to be used. |
pattern |
character string containing a regular expression (or
character string for |
FUN |
a function, formula, character string, list or proto object
to be applied to each element of
|
backref |
See |
empty |
If there is no match to a string return this value. |
ignore.case |
If |
perl |
If |
engine |
This argument defaults to |
... |
optional arguments to |
simplify |
logical or function. If logical, should the result be
simplified to a vector or matrix, as in |
USE.NAMES |
logical; if |
combine |
combine is a function applied to the components of
the result of |
Details
If FUN
is a function then for
each character string in "X"
the pattern is repeatedly
matched,
each such match along with
back references, if any, are passed to
the function "FUN"
and the output of FUN
is returned as a list.
If FUN
is a formula or proto object then it is interpreted
to the way discussed in gsubfn
.
If FUN
is a proto object or if perl=TRUE
is specified
then engine="R"
is used and the engine
argument is ignored.
If backref
is not specified and
engine="R"
is specified or implied then a heuristic is
used to calculate the number of backreferences. The primary situation
that can fool it is if there are parentheses in the string that are
not back references.
In those cases the user will have to specify backref.
If engine="tcl"
then an exact algorithm is used and the problem
sentence never occurs.
strapplyc
is like strapply
but specialized to FUN=c
for
speed. If the "tcl"
engine is not available then it calls
strapply
and there will be no speed advantage.
Value
A list of character strings.
See Also
See gsubfn
.
For regular expression syntax used in tcl see
http://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm
and for regular expression syntax used in R see the help page for regex
.
Examples
strapply("12;34:56,89,,12", "[0-9]+")
# separate leading digits from rest of string
# creating a 2 column matrix: digits, rest
s <- c("123abc", "12cd34", "1e23")
t(strapply(s, "^([[:digit:]]+)(.*)", c, simplify = TRUE))
# same but create matrix
strapply(s, "^([[:digit:]]+)(.*)", c, simplify = rbind)
# running window of 5 characters using 0-lookahead perl regexp
# Note that the three ( in the regexp will fool it into thinking there
# are three backreferences so specify backref explicitly.
x <- "abcdefghijkl"
strapply(x, "(.)(?=(....))", paste0, backref = -2, perl = TRUE)[[1]]
# Note difference. First gives character vector. Second is the same.
# Third has same elements but is a list.
# Fourth gives list of two character vectors. Fifth is the same.
strapply("a:b c:d", "(.):(.)", c)[[1]]
strapply("a:b c:d", "(.):(.)", list, simplify = unlist) # same
strapply("a:b c:d", "(.):(.)", list)[[1]]
strapply("a:b c:d", "(.):(.)", c, combine = list)[[1]]
strapply("a:b c:d", "(.):(.)", c, combine = list, simplify = c) # same
# find second CPU_SPEED value given lines of config file
Lines <- c("DEVICE = 'PC'", "CPU_SPEED = '1999', '233'")
parms <- strapply(Lines, "[^ ',=]+", c, USE.NAMES = TRUE,
simplify = ~ lapply(list(...), "[", -1))
parms$CPU_SPEED[2]
# return first two words in each string
p <- proto(fun = function(this, x) if (count <=2) x)
strapply(c("the brown fox", "the eager beaver"), "\\w+", p)
## Not run:
# convert to chron
library(chron)
x <- c("01/15/2005 23:32:45", "02/27/2005 01:22:30")
x.chron <- strapply(x, "(../../....) (..:..:..)", chron, simplify = c)
# time parsing of all 275,546 words from James Joyce's Ulysses
joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joycec <- paste(joyce, collapse = " ")
system.time(s <- strapplyc(joycec, "\\w+")[[1]])
length(s) # 275546
## End(Not run)