textmodel_lsa {quanteda.textmodels} | R Documentation |
Latent Semantic Analysis
Description
Fit the Latent Semantic Analysis scaling model to a dfm, which may be
weighted (for instance using quanteda::dfm_tfidf()
).
Usage
textmodel_lsa(x, nd = 10, margin = c("both", "documents", "features"))
Arguments
x |
the dfm on which the model will be fit |
nd |
the number of dimensions to be included in output |
margin |
margin to be smoothed by the SVD |
Details
svds in the RSpectra package is applied to enable the fast computation of the SVD.
Value
a textmodel_lsa
class object, a list containing:
-
sk
a numeric vector containing the d values from the SVD -
docs
document coordinates from the SVD (u) -
features
feature coordinates from the SVD (v) -
matrix_low_rank
the multiplication of udv' -
data
the input data as a CSparseMatrix from the Matrix package
Note
The number of dimensions nd
retained in LSA is an empirical
issue. While a reduction in k
can remove much of the noise, keeping
too few dimensions or factors may lose important information.
Author(s)
Haiyan Wang and Kohei Watanabe
References
Rosario, B. (2000). Latent Semantic Indexing: An Overview. Technical report INFOSYS 240 Spring Paper, University of California, Berkeley.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391.
See Also
predict.textmodel_lsa()
, coef.textmodel_lsa()
Examples
library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
# create an LSA space and return its truncated representation in the low-rank space
tmod <- textmodel_lsa(dfmat[1:10, ])
head(tmod$docs)
# matrix in low_rank LSA space
tmod$matrix_low_rank[,1:5]
# fold queries into the space generated by dfmat[1:10,]
# and return its truncated versions of its representation in the new low-rank space
pred <- predict(tmod, newdata = dfmat[11:14, ])
pred$docs_newspace