textmodel_lsa {quanteda.textmodels} | R Documentation |
Latent Semantic Analysis
Description
Fit the Latent Semantic Analysis scaling model to a dfm, which may be
weighted (for instance using quanteda::dfm_tfidf()
).
Usage
textmodel_lsa(x, nd = 10, margin = c("both", "documents", "features"))
Arguments
x |
the dfm on which the model will be fit |
nd |
the number of dimensions to be included in output |
margin |
margin to be smoothed by the SVD |
Details
svds in the RSpectra package is applied to enable the fast computation of the SVD.
Value
a textmodel_lsa
class object, a list containing:
-
sk
a numeric vector containing the d values from the SVD -
docs
document coordinates from the SVD (u) -
features
feature coordinates from the SVD (v) -
matrix_low_rank
the multiplication of udv' -
data
the input data as a CSparseMatrix from the Matrix package
Note
The number of dimensions nd
retained in LSA is an empirical
issue. While a reduction in can remove much of the noise, keeping
too few dimensions or factors may lose important information.
Author(s)
Haiyan Wang and Kohei Watanabe
References
Rosario, B. (2000). Latent Semantic Indexing: An Overview. Technical report INFOSYS 240 Spring Paper, University of California, Berkeley.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391.
See Also
predict.textmodel_lsa()
, coef.textmodel_lsa()
Examples
library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
# create an LSA space and return its truncated representation in the low-rank space
tmod <- textmodel_lsa(dfmat[1:10, ])
head(tmod$docs)
# matrix in low_rank LSA space
tmod$matrix_low_rank[,1:5]
# fold queries into the space generated by dfmat[1:10,]
# and return its truncated versions of its representation in the new low-rank space
pred <- predict(tmod, newdata = dfmat[11:14, ])
pred$docs_newspace