grab.fragments {textreg} | R Documentation |
Grab all fragments in a corpus with given phrase.
Description
Search corpus for passed phrase, using some wildcard notation. Return snippits of text containing this phrase, with a specified number of characters before and after. This gives context for phrases in documents.
Use like this frags = grab.fragments( "israel", bigcorp )
Can take phrases such as 'appl+' which means any word starting with "appl." Can also take phrases such as "big * city" which consist of any three-word phrase with "big" as the first word and "city" as the third word.
If a pattern matches overlapping phrases, it will return the first but not the second.
Usage
grab.fragments(phrase, corp, char.before = 80,
char.after = char.before, cap.phrase = TRUE, clean = FALSE)
Arguments
phrase |
Phrase to find in corpus |
corp |
is a tm corpus |
char.before |
Number of characters of document to pull before phrase to give context. |
char.after |
As above, but trailing characters. Defaults to char.before value. |
cap.phrase |
TRUE if the phrase should be put in ALL CAPS. False if left alone. |
clean |
True means drop all documents without phrase from list. False means leave NULLs in the list. |
Value
fragments in corp that have given phrase.List of lists. First list is len(corp) long with NULL values for documents without phrase, and lists of phrases for those documents with the phrase
Examples
library( tm )
docs = c( "987654321 test 123456789", "987654321 test test word 123456789",
"test at start", "a test b", "this is a test", "without the t-word",
"a test for you and a test for me" )
corpus <- VCorpus(VectorSource(docs))
grab.fragments( "test *", corpus, char.before=4, char.after=4 )