R: Split, Map, Filter, and Reduce a string vector

split_map_filter_reduce {libbib}

R Documentation

Split, Map, Filter, and Reduce a string vector

Description

This function takes a vector of strings, splits those strings on a particular character; string; or regex patters, applies a user-specified function to each sub-element of the now split element, filters those sub-elements using a user-specified function, and, finally, recombines each element's sub-elements using a user specified reduction function.

Usage

split_map_filter_reduce(
  x,
  sep = ";",
  fixed = TRUE,
  mapfun = identity,
  filterfun = identity,
  reduxfun = car,
  cl = 0
)

Arguments

`x`	A vector of strings
`sep`	A character to use containing a character, string, or regular expression pattern to split each element by. If `fixed=TRUE`, the separator will be used exactly; If not, a Perl-compatible regular expression can be used (default is ";")
`fixed`	Should it be split by a fixed string/character or a regular expression (default is `TRUE`)
`mapfun`	A vectorized function that will be applied to the sub-elements (after splitting) of each element in x (default is `identity` which would leave the sub-elements unchanged)
`filterfun`	A vectorized function that, when given a vector returns the same vector with un-wanted elements removed (default is `identity` which would not remove any sub-elements)
`reduxfun`	A vectorized function that, when given a vector, will combine all of it's elements into one value (default is `car`, which would return the first element only)
`cl`	An integer to indicate the number of child processes should be used to parallelize the work-load. If 0, the workload will not be parallelized. Can also take a cluster object created by 'makeCluster' (default is 0)

Details

Since this operation cannot be vectorized, if the user specifies a non-zero cl argument, the workload will be parallelized and cl many child processes will be spawned to do the work. The package pbapply will be used to do this.

See examples for more information and ideas on why this might be useful for, as an example, batch normalizing ISBNs that, for each bibliographic record, is separated by a semicolon

Value

Returns a vector

Examples


someisbns <- c("9782711875177;garbage-isbn;2711875172;2844268900",
               "1861897952; 978-1-86189-795-4")

# will return only the first ISBN for each record
split_map_filter_reduce(someisbns)
# "9782711875177" "1861897952"

# will return only the first ISBN for each record, after normalizing
# each ISBN
split_map_filter_reduce(someisbns, mapfun=function(x){normalize_isbn(x, convert.to.isbn.13=TRUE)})
# "9782711875177" "9781861897954"

# will return all ISBNs, for each record, separated by a semicolon
# after applying normalize_isbn to each ISBN
# note the duplicates introduced after normalization occurs
split_map_filter_reduce(someisbns, mapfun=function(x){normalize_isbn(x, convert.to.isbn.13=TRUE)},
                        reduxfun=recombine_with_sep_closure())
# "9782711875177;NA;9782711875177;9782844268907" "9781861897954;9781861897954"

# After splitting each items ISBN list by semicolon, this runs
# normalize_isbn in each of them. Duplicates are produced when
# an ISBN 10 converts to an ISBN 13 that is already in the ISBN
# list for the item. NAs are produced when an ISBN fails to normalize.
# Then, all duplicates and NAs are removed. Finally, the remaining
# ISBNs, for each record, are pasted together using a space as a separator
split_map_filter_reduce(someisbns, mapfun=function(x){normalize_isbn(x, convert.to.isbn.13=TRUE)},
                        filterfun=remove_duplicates_and_nas,
                        reduxfun=recombine_with_sep_closure(" "))
# "9782711875177 9782844268907" "9781861897954"

[Package libbib version 1.6.4 Index]