R: Term Document Matrix

CTDM {CTM}

R Documentation

Term Document Matrix

Description

Constructs Term-Document Matrix from Chinese Text Documents.

Usage

CTDM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE,
  shortTermDeleted = TRUE)

Arguments

`doc`	The Chinese text document. A vector of Chinese strings.
`weighting`	Available weighting function with matrix are binary, count, tf, tfidf. See details.
`EngTermDeleted`	remove English from text documents.
`NumTermDeleted`	remove Numbers from text documents.
`shortTermDeleted`	Deltected short word when nchar <2.

Details

This function run a Chinese word segmentation by jiebeR and build term-document matrix, and there is four weighting function with matrix, and "binary" means value can only be 1 if the term occurs, "count" means how many times the term occurs in a doc, "tf" means term frequency and "tfidf" means term frequency inverse document frequency.

Author(s)

Jim Liu, Quan Gu

Examples

library(CTM)
a1 <- "hello taiwan"
b1 <- "world of tank"
c1 <- "taiwan weather"
d1 <- "local weather"
text1 <- t(data.frame(a1,b1,c1,d1))
tdm1 <- CTDM(doc = text1, weighting = "tfidf", EngTermDeleted = FALSE, shortTermDeleted = FALSE)

[Package CTM version 0.2 Index]