R: Create a random sample of files

sample.textmatrix {lsa}

R Documentation

Create a random sample of files

Description

Creates a subset of the documents of a corpus to help reduce a corpus in size through random sampling.

Usage

   sample.textmatrix(textmatrix, samplesize, index.return=FALSE)

Arguments

`textmatrix`	A document-term matrix.
`samplesize`	Desired number of files
`index.return`	if set to true, the positions of the subset in the original column vectors will be returned as well.

Details

Often a corpus is so big that it cannot be processed in memory. One technique to reduce the size is to select a subset of the documents randomly, assuming that through the random selection the nature of the term sets and distributions will not be changed.

Value

`filelist`	a list of filenames of the documents in the corpus.).
`ix`	If index.return is set to true, a list is returned; `x` contains the filenames and `ix` contains the position of the sample files in the original filelist.

Author(s)

Fridolin Wild f.wild@open.ac.uk

Examples


# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/"))
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/"))
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/"))
write( c("dog", "mouse", "dog"), file=paste(td, "D4", sep="/"))

# create matrices
myMatrix = textmatrix(td, minWordLength=1)

sample(myMatrix, 3)

# clean up
unlink(td, recursive=TRUE)

[Package lsa version 0.73.3 Index]