weblmBreakIntoWords {mscsweblm4r} | R Documentation |
Breaks a string of concatenated words into individual words
Description
This function inserts spaces into a string of words lacking spaces, like a hashtag or part of a URL. Punctuation or exotic characters can prevent a string from being broken, so it's best to limit input strings to lower-case, alpha-numeric characters. The input string must be in ASCII format.
Internally, this function invokes the Microsoft Cognitive Services Web Language Model REST API documented at https://www.microsoft.com/cognitive-services/en-us/web-language-model-api/documentation.
You MUST have a valid Microsoft Cognitive Services account and an API key for this function to work properly. See https://www.microsoft.com/cognitive-services/en-us/pricing for details.
Usage
weblmBreakIntoWords(textToBreak, modelToUse = "body", orderOfNgram = 5L,
maxNumOfCandidatesReturned = 5L)
Arguments
textToBreak |
(character) Line of text to break into words. If spaces are present, they will be interpreted as hard breaks and maintained, except for leading or trailing spaces, which will be trimmed. Must be in ASCII format. |
modelToUse |
(character) Which language model to use, supported values: "title", "anchor", "query", or "body" (optional, default: "body") |
orderOfNgram |
(integer) Which order of N-gram to use, supported values: 1L, 2L, 3L, 4L, or 5L (optional, default: 5L) |
maxNumOfCandidatesReturned |
(integer) Maximum number of candidates to return (optional, default: 5L) |
Value
An S3 object of the class weblm
. The results are stored in
the results
dataframe inside this object. The dataframe contains the
candidate breakdowns and their log(probability).
Author(s)
Phil Ferriere pferriere@hotmail.com
Examples
## Not run:
tryCatch({
# Break a sentence into words
textWords <- weblmBreakIntoWords(
textToBreak = "testforwordbreak", # ASCII only
modelToUse = "body", # "title"|"anchor"|"query"(default)|"body"
orderOfNgram = 5L, # 1L|2L|3L|4L|5L(default)
maxNumOfCandidatesReturned = 5L # Default: 5L
)
# Class and structure of textWords
class(textWords)
#> [1] "weblm"
str(textWords, max.level = 1)
#> List of 3
#> $ results:'data.frame': 5 obs. of 2 variables:
#> $ json : chr "{"candidates":[{"words":"test for word break", __truncated__ }]}
#> $ request:List of 7
#> ..- attr(*, "class")= chr "request"
#> - attr(*, "class")= chr "weblm"
# Print results
pandoc.table(textWords$results)
#> ---------------------------------
#> words probability
#> ------------------- -------------
#> test for word break -13.83
#>
#> test for wordbreak -14.63
#>
#> testfor word break -15.94
#>
#> test forword break -16.72
#>
#> testfor wordbreak -17.41
#> ---------------------------------
}, error = function(err) {
# Print error
geterrmessage()
})
## End(Not run)