sort_tf {chinese.misc}R Documentation

Find High Frequency Terms

Description

By inputting a matrix, or a document term matrix, or term document matrix, this function counts the sum of each term and output top n terms. The result can be messaged on the screen, so that you can manually copy them to other places (e. g., Excel).

Usage

sort_tf(x, top = 10, type = "dtm", todf = FALSE, must_exact = FALSE)

Arguments

x

a matrix, or an object created by corp_or_dtm or by tm::DocumentTermMatrix, or tm::TermDocumentMatrix. Data frame is not allowed. If it is a matrix, the column names (if type is "dtm") or row names (if type is "tdm") is taken to be terms, see below. If the names are NULL, terms are set to "term1", "term2", "term3"...automatically.

top

a length 1 integer. As terms are in the decreasing order of the term frequency, this argument decides how many top terms should be returned. The default is 10. If the number of terms is smaller than top, all terms are returned. Sometimes the returned terms are more than top, see below.

type

should start with "D/d" representing document term matrix, or "T/t" representing term document matrix. It is only used when x is a matrix. The default is "dtm".

todf

should be TRUE or FALSE. If it is FALSE (default) terms and their frequencies will be pasted by "&" and messaged on the screen, nothing is returned. Otherwise, terms and frequencies will be returned as data frame.

must_exact

should be TRUE or FALSE (default). It decides whether the number of returned words should be equal to that specified by top. See Details.

Details

Sometimes you may pick more terms than specified by top. For example, you specify to pick up the top 5 terms, and the frequency of the 5th term is 20. But in fact there are two more terms that have frequency of 20. As a result, sort_tf may pick up 7 terms. If you want the number is exactly 5, set must_exact to TRUE.

Value

return nothing and message the result, or return a data frame.

Examples

require(tm)
x <- c(
  "Hello, what do you want to drink?", 
  "drink a bottle of milk", 
  "drink a cup of coffee", 
  "drink some water", 
  "hello, drink a cup of coffee")
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
# Argument top is 5, but more than 5 terms are returned
sort_tf(dtm, top = 5)
# Set must_exact to TRUE, return exactly 5 terms
sort_tf(dtm, top=5, must_exact=TRUE)
# Input is a matrix and terms are not specified
m=as.matrix(dtm)
colnames(m)=NULL
mt=t(m)
sort_tf(mt, top=5, type="tdm")

[Package chinese.misc version 0.2.3 Index]