R: Document Term Matrix

CDTM {CTM}

R Documentation

Document Term Matrix

Description

Constructs Document-Term Matrix from Chinese Text Documents.

Usage

CDTM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE,
  shortTermDeleted = TRUE)

Arguments

`doc`	The Chinese text document. A vector of Chinese strings.
`weighting`	Available weighting function with matrix are binary, count, tf, tfidf. See details.
`EngTermDeleted`	remove English from text documents.
`NumTermDeleted`	remove Numbers from text documents.
`shortTermDeleted`	Deltected short word when nchar <2.

Details

This function run a Chinese word segmentation by jiebeR and build document-term matrix, and there is four weighting function with matrix, and "binary" means value can only be 1 if the term occurs, "count" means how many times the term occurs in a doc, "tf" means term frequency and "tfidf" means term frequency inverse document frequency.

Author(s)

Jim Liu, Quan Gu

Examples

library(CTM)
a1 <- "hello taiwan"
b1 <- "world of tank"
c1 <- "taiwan weather"
d1 <- "local weather"
text1 <- t(data.frame(a1,b1,c1,d1))
dtm1 <- CTDM(doc = text1, weighting = "tfidf",EngTermDeleted = FALSE, shortTermDeleted = FALSE)

[Package CTM version 0.2 Index]