Functions for Text Cleansing and Text Analysis

Documentation for package ‘textTools’ version 0.1.0

DESCRIPTION file.

Help Pages

as.text.table	Convert a data.table column of character vectors into a column with one row per word grouped by a grouping column. Optionally will split a column of strings into vectors of constituents.
flag_words	Flag rows in a text.table with specific words
label_parts_of_speech	Add a column with the parts of speech for each word in a text.table
l_pos	Parts of speech for English words from the Moby Project.
ngrams	Create n-grams
pos	Parts of speech for English words from the Moby Project.
regex_paragraph	Regular expression that might be used to split strings of text into component paragraphs.
regex_sentence	Regular expression that might be used to split strings of text into component sentences.
regex_word	Regular expression that might be used to split strings of text into component words.
rm_frequent_words	Delete rows in a text.table where the number of identical records within a group is more than a certain threshold
rm_infrequent_words	Delete rows in a text.table where the number of identical records within a group is less than a certain threshold
rm_long_words	Delete rows in a text.table where the word has more than a minimum number of characters
rm_no_overlap	Delete rows in a text.table where the records within a group are not also found in other groups (overlapping records)
rm_overlap	Delete rows in a text.table where the records within a group are also found in other groups (overlapping records)
rm_parts_of_speech	Delete rows in a text.table where the word has a certain part of speech
rm_regexp_match	Delete rows in a text.table where the record has a certain pattern indicated by a regular expression
rm_short_words	Delete rows in a text.table where the word has less than a minimum number of characters
rm_words	Remove rows from a text.table with specific words
sampleStr	Generates (pseudo)random strings of the specified char length
stopwords	Vector of lowercase English stop words.
str_any_match	Detect if there are any words in a vector also found in another vector.
str_counts	Create a list of a vector of unique words found in x and a vector of the counts of each word in x.
str_count_intersect	Count the intersecting words in a vector that are found in another vector (only counts unique words).
str_count_jaccard_similarity	Calculates the intersect divided by union of two vectors of words.
str_count_match	Count the words in a vector that are found in another vector.
str_count_nomatch	Count the words in a vector that are not found in another vector.
str_count_positional_match	Count words from a vector that are found in the same position in another vector.
str_count_positional_nomatch	Count words from a vector that are not found in the same position in another vector.
str_count_setdiff	Count the words in a vector that don't intersect with another vector (only counts unique words).
str_dt_col_combine	Combine columns of a data.table into a list in a new column, wraps list(unlist(c(...)))
str_extract_match	Extract words from a vector that are found in another vector.
str_extract_nomatch	Extract words from a vector that are not found in another vector.
str_extract_positional_match	Extract words from a vector that are found in the same position in another vector.
str_extract_positional_nomatch	Extract words from a vector that are not found in the same position in another vector.
str_rm_blank_space	Remove and replace excess white space from strings.
str_rm_long_words	Remove words from a vector that have more than a maximum number of characters.
str_rm_non_alphanumeric	Remove and replace non-alphanumeric characters from strings.
str_rm_non_printable	Remove and replace non-printable characters from strings.
str_rm_numbers	Remove and replace numbers from strings.
str_rm_punctuation	Remove and replace punctuation from strings.
str_rm_regexp_match	Remove words from a vector that match a regular expression.
str_rm_short_words	Remove words from a vector that don't have a minimum number of characters.
str_rm_words	Remove words from a vector of words found in another vector of words.
str_rm_words_by_length	Remove words from a vector based on the number of characters in each word.
str_stopwords_by_part_of_speech	Create a vector of English words associated with particular parts of speech.
str_tolower	Calls base::tolower(), which converts letters to lowercase. Only included to point out that base::tolower exists and should be used directly.
str_weighted_count_match	Weighted count of the words in a vector that are found in another vector.