Functions for Text Cleansing and Text Analysis


[Up] [Top]

Documentation for package ‘textTools’ version 0.1.0

Help Pages

as.text.table Convert a data.table column of character vectors into a column with one row per word grouped by a grouping column. Optionally will split a column of strings into vectors of constituents.
flag_words Flag rows in a text.table with specific words
label_parts_of_speech Add a column with the parts of speech for each word in a text.table
l_pos Parts of speech for English words from the Moby Project.
ngrams Create n-grams
pos Parts of speech for English words from the Moby Project.
regex_paragraph Regular expression that might be used to split strings of text into component paragraphs.
regex_sentence Regular expression that might be used to split strings of text into component sentences.
regex_word Regular expression that might be used to split strings of text into component words.
rm_frequent_words Delete rows in a text.table where the number of identical records within a group is more than a certain threshold
rm_infrequent_words Delete rows in a text.table where the number of identical records within a group is less than a certain threshold
rm_long_words Delete rows in a text.table where the word has more than a minimum number of characters
rm_no_overlap Delete rows in a text.table where the records within a group are not also found in other groups (overlapping records)
rm_overlap Delete rows in a text.table where the records within a group are also found in other groups (overlapping records)
rm_parts_of_speech Delete rows in a text.table where the word has a certain part of speech
rm_regexp_match Delete rows in a text.table where the record has a certain pattern indicated by a regular expression
rm_short_words Delete rows in a text.table where the word has less than a minimum number of characters
rm_words Remove rows from a text.table with specific words
sampleStr Generates (pseudo)random strings of the specified char length
stopwords Vector of lowercase English stop words.
str_any_match Detect if there are any words in a vector also found in another vector.
str_counts Create a list of a vector of unique words found in x and a vector of the counts of each word in x.
str_count_intersect Count the intersecting words in a vector that are found in another vector (only counts unique words).
str_count_jaccard_similarity Calculates the intersect divided by union of two vectors of words.
str_count_match Count the words in a vector that are found in another vector.
str_count_nomatch Count the words in a vector that are not found in another vector.
str_count_positional_match Count words from a vector that are found in the same position in another vector.
str_count_positional_nomatch Count words from a vector that are not found in the same position in another vector.
str_count_setdiff Count the words in a vector that don't intersect with another vector (only counts unique words).
str_dt_col_combine Combine columns of a data.table into a list in a new column, wraps list(unlist(c(...)))
str_extract_match Extract words from a vector that are found in another vector.
str_extract_nomatch Extract words from a vector that are not found in another vector.
str_extract_positional_match Extract words from a vector that are found in the same position in another vector.
str_extract_positional_nomatch Extract words from a vector that are not found in the same position in another vector.
str_rm_blank_space Remove and replace excess white space from strings.
str_rm_long_words Remove words from a vector that have more than a maximum number of characters.
str_rm_non_alphanumeric Remove and replace non-alphanumeric characters from strings.
str_rm_non_printable Remove and replace non-printable characters from strings.
str_rm_numbers Remove and replace numbers from strings.
str_rm_punctuation Remove and replace punctuation from strings.
str_rm_regexp_match Remove words from a vector that match a regular expression.
str_rm_short_words Remove words from a vector that don't have a minimum number of characters.
str_rm_words Remove words from a vector of words found in another vector of words.
str_rm_words_by_length Remove words from a vector based on the number of characters in each word.
str_stopwords_by_part_of_speech Create a vector of English words associated with particular parts of speech.
str_tolower Calls base::tolower(), which converts letters to lowercase. Only included to point out that base::tolower exists and should be used directly.
str_weighted_count_match Weighted count of the words in a vector that are found in another vector.