CDTM {CTM}R Documentation

Document Term Matrix

Description

Constructs Document-Term Matrix from Chinese Text Documents.

Usage

CDTM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE,
  shortTermDeleted = TRUE)

Arguments

doc

The Chinese text document. A vector of Chinese strings.

weighting

Available weighting function with matrix are binary, count, tf, tfidf. See details.

EngTermDeleted

remove English from text documents.

NumTermDeleted

remove Numbers from text documents.

shortTermDeleted

Deltected short word when nchar <2.

Details

This function run a Chinese word segmentation by jiebeR and build document-term matrix, and there is four weighting function with matrix, and "binary" means value can only be 1 if the term occurs, "count" means how many times the term occurs in a doc, "tf" means term frequency and "tfidf" means term frequency inverse document frequency.

Author(s)

Jim Liu, Quan Gu

Examples

library(CTM)
a1 <- "hello taiwan"
b1 <- "world of tank"
c1 <- "taiwan weather"
d1 <- "local weather"
text1 <- t(data.frame(a1,b1,c1,d1))
dtm1 <- CTDM(doc = text1, weighting = "tfidf",EngTermDeleted = FALSE, shortTermDeleted = FALSE)

[Package CTM version 0.2 Index]