alignCorpus {stm} | R Documentation |
Align the vocabulary of a new corpus to an old corpus
Description
Function that takes in a list of documents, vocab and (optionally) metadata
for a corpus of previously unseen documents and aligns them to an old vocabulary.
Helps preprocess documents for fitNewDocuments
.
Usage
alignCorpus(new, old.vocab, verbose = TRUE)
Arguments
new |
a list (such as those produced by |
old.vocab |
a character vector containing the vocabulary that you want to align to.
In general this will be the vocab used in your original stm model fit which from an stm
object called |
verbose |
a logical indicating whether information about the new corpus should be
printed to the screen. Defaults to |
Details
When estimating topic proportions for previously unseen documents using
fitNewDocuments
the new documents must have the same vocabulary
ordered in the same was as the original model. This function helps with that
process.
Note: the code is not really built for speed or memory efficiency- if you are trying to do this with a really large corpus of new texts you might consider building the object yourself using quanteda or some other option.
Value
documents |
A list containing the documents in the stm format. |
vocab |
Character vector of vocabulary. |
meta |
Data frame or matrix containing the user-supplied metadata for the retained documents. |
docs.removed |
document indices (corresponding to the original data passed) of documents removed because they contain no words |
words.removed |
words dropped from |
tokens.removed |
the total number of tokens dropped from the new documents. |
wordcounts |
counts of times the old vocab appears in the new documents |
prop.overlap |
length two vector used to populate the message printed by verbose. |
See Also
Examples
#we process an original set that is just the first 100 documents
temp<-textProcessor(documents=gadarian$open.ended.response[1:100],metadata=gadarian[1:100,])
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
set.seed(02138)
#Maximum EM its is set low to make this run fast, run models to convergence!
mod.out <- stm(out$documents, out$vocab, 3, prevalence=~treatment + s(pid_rep),
data=out$meta, max.em.its=5)
#now we process the remaining documents
temp<-textProcessor(documents=gadarian$open.ended.response[101:nrow(gadarian)],
metadata=gadarian[101:nrow(gadarian),])
#note we don't run prepCorpus here because we don't want to drop any words- we want
#every word that showed up in the old documents.
newdocs <- alignCorpus(new=temp, old.vocab=mod.out$vocab)
#we get some helpful feedback on what has been retained and lost in the print out.
#and now we can fit our new held-out documents
fitNewDocuments(model=mod.out, documents=newdocs$documents, newData=newdocs$meta,
origData=out$meta, prevalence=~treatment + s(pid_rep),
prevalencePrior="Covariate")