R: Obtain a tokenised data frame by splitting text alongside a...

strsplit.data.frame {udpipe}

R Documentation

Obtain a tokenised data frame by splitting text alongside a regular expression

Description

Obtain a tokenised data frame by splitting text alongside a regular expression. This is the inverse operation of paste.data.frame.

Usage

strsplit.data.frame(
  data,
  term,
  group,
  split = "[[:space:][:punct:][:digit:]]+",
  ...
)

Arguments

`data`	a data.frame or data.table
`term`	a character with a column name from `data` which you want to split into tokens
`group`	a string with a column name or a character vector of column names from `data` indicating identifiers of groups. The text in `term` will be split into tokens by group.
`split`	a regular expression indicating how to split the `term` column. Defaults to splitting by spaces, punctuation symbols or digits. This will be passed on to `strsplit`.
`...`	further arguments passed on to `strsplit`

Value

A tokenised data frame containing one row per token.
This data.frame has the columns from group and term where the text in column term will be split by the provided regular expression into tokens.

Examples

data(brussels_reviews, package = "udpipe")
x <- strsplit.data.frame(brussels_reviews, term = "feedback", group = "id")
head(x)
x <- strsplit.data.frame(brussels_reviews, 
                         term = c("feedback"), 
                         group = c("listing_id", "language"))
head(x)  
x <- strsplit.data.frame(brussels_reviews, term = "feedback", group = "id", 
                         split = " ", fixed = TRUE)
head(x)