sample.textmatrix {lsa} | R Documentation |
Create a random sample of files
Description
Creates a subset of the documents of a corpus to help reduce a corpus in size through random sampling.
Usage
sample.textmatrix(textmatrix, samplesize, index.return=FALSE)
Arguments
textmatrix |
A document-term matrix. |
samplesize |
Desired number of files |
index.return |
if set to true, the positions of the subset in the original column vectors will be returned as well. |
Details
Often a corpus is so big that it cannot be processed in memory. One technique to reduce the size is to select a subset of the documents randomly, assuming that through the random selection the nature of the term sets and distributions will not be changed.
Value
filelist |
a list of filenames of the documents in the corpus.). |
ix |
If index.return is set to true, a list is returned; |
Author(s)
Fridolin Wild f.wild@open.ac.uk
See Also
Examples
# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/"))
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/"))
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/"))
write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/"))
# create matrices
myMatrix = textmatrix(td, minWordLength=1)
sample(myMatrix, 3)
# clean up
unlink(td, recursive=TRUE)
[Package lsa version 0.73.3 Index]