fold_in {lsa} | R Documentation |
Ex-post folding-in of textmatrices into an existing latent semantic space
Description
Additional documents can be mapped into a pre-exisiting latent semantic space without influencing the factor distribution of the space. Applied, when additional documents must not influence the calculated existing latent semantic factor structure.
Usage
fold_in( docvecs, LSAspace )
Arguments
LSAspace |
a latent semantic space generated by createLSAspace. |
docvecs |
a textmatrix. |
Details
To keep additional documents from influencing the factor distribution
calculated previously from a particular text basis, they can be folded-in
after the singular value decomposition performed in lsa()
.
Background Information:
For folding-in, a pseudo document vector mi
of the new documents
is calculated into as shown in the equations (1) and (2) (cf. Berry et al., 1995):
(1) \hat{d} = v^T T_k S_k^{-1}
(2) \hat{m} = T_k S_k \hat{d}
The document vector v^T
in equation~(1) is identical to an additional
column of an input textmatrix M
with the term frequencies of the
essay to be folded-in. T_k
and S_k
are the truncated matrices
from the SVD applied through lsa()
on a given text
collection to construct the latent semantic space. The resulting vector
\hat{m}
from equation~(2) is identical to an additional column in the
textmatrix representation of the latent semantic space (as produced by
as.textmatrix()
). Be careful when using weighting schemes: you
may want to use the global weights of the training textmatrix also for
your new data that you fold-in!
Value
textmatrix |
a textmatrix representation of the additional documents in the latent semantic space. |
Author(s)
Fridolin Wild f.wild@open.ac.uk
See Also
textmatrix
, lsa
, as.textmatrix
Examples
# create a first textmatrix with some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") )
matrix1 = textmatrix(td, minWordLength=1)
unlink(td, recursive=TRUE)
# create a second textmatrix with some more files
td = tempfile()
dir.create(td)
write( c("cat", "mouse", "mouse"), file=paste(td, "A1", sep="/") )
write( c("nothing", "mouse", "monster"), file=paste(td, "A2", sep="/") )
write( c("cat", "monster", "monster"), file=paste(td, "A3", sep="/") )
matrix2 = textmatrix(td, vocabulary=rownames(matrix1), minWordLength=1)
unlink(td, recursive=TRUE)
# create an LSA space from matrix1
space1 = lsa(matrix1, dims=dimcalc_share())
as.textmatrix(space1)
# fold matrix2 into the space generated by matrix1
fold_in( matrix2, space1)