weblmBreakIntoWords {mscsweblm4r}R Documentation

Breaks a string of concatenated words into individual words

Description

This function inserts spaces into a string of words lacking spaces, like a hashtag or part of a URL. Punctuation or exotic characters can prevent a string from being broken, so it's best to limit input strings to lower-case, alpha-numeric characters. The input string must be in ASCII format.

Internally, this function invokes the Microsoft Cognitive Services Web Language Model REST API documented at https://www.microsoft.com/cognitive-services/en-us/web-language-model-api/documentation.

You MUST have a valid Microsoft Cognitive Services account and an API key for this function to work properly. See https://www.microsoft.com/cognitive-services/en-us/pricing for details.

Usage

weblmBreakIntoWords(textToBreak, modelToUse = "body", orderOfNgram = 5L,
  maxNumOfCandidatesReturned = 5L)

Arguments

textToBreak

(character) Line of text to break into words. If spaces are present, they will be interpreted as hard breaks and maintained, except for leading or trailing spaces, which will be trimmed. Must be in ASCII format.

modelToUse

(character) Which language model to use, supported values: "title", "anchor", "query", or "body" (optional, default: "body")

orderOfNgram

(integer) Which order of N-gram to use, supported values: 1L, 2L, 3L, 4L, or 5L (optional, default: 5L)

maxNumOfCandidatesReturned

(integer) Maximum number of candidates to return (optional, default: 5L)

Value

An S3 object of the class weblm. The results are stored in the results dataframe inside this object. The dataframe contains the candidate breakdowns and their log(probability).

Author(s)

Phil Ferriere pferriere@hotmail.com

Examples

## Not run: 
 tryCatch({

   # Break a sentence into words
   textWords <- weblmBreakIntoWords(
     textToBreak = "testforwordbreak", # ASCII only
     modelToUse = "body",              # "title"|"anchor"|"query"(default)|"body"
     orderOfNgram = 5L,                # 1L|2L|3L|4L|5L(default)
     maxNumOfCandidatesReturned = 5L   # Default: 5L
   )

   # Class and structure of textWords
   class(textWords)
   #> [1] "weblm"

   str(textWords, max.level = 1)
   #> List of 3
   #>  $ results:'data.frame':  5 obs. of  2 variables:
   #>  $ json   : chr "{"candidates":[{"words":"test for word break", __truncated__ }]}
   #>  $ request:List of 7
   #>   ..- attr(*, "class")= chr "request"
   #>  - attr(*, "class")= chr "weblm"

   # Print results
   pandoc.table(textWords$results)
   #> ---------------------------------
   #>       words          probability
   #> ------------------- -------------
   #> test for word break    -13.83
   #>
   #>  test for wordbreak    -14.63
   #>
   #>  testfor word break    -15.94
   #>
   #>  test forword break    -16.72
   #>
   #>   testfor wordbreak    -17.41
   #> ---------------------------------

 }, error = function(err) {

   # Print error
   geterrmessage()

 })

## End(Not run)

[Package mscsweblm4r version 0.1.2 Index]