topics {maptpx} | R Documentation |
Estimation for Topic Models
Description
MAP estimation of Topic models
Usage
topics(counts, K, shape=NULL, initopics=NULL,
tol=0.1, bf=FALSE, kill=2, ord=TRUE, verb=1, ...)
Arguments
counts |
A matrix of multinomial response counts in |
K |
The number of latent topics. If |
shape |
Optional argument to specify the Dirichlet prior concentration parameter as |
initopics |
Optional start-location for |
tol |
Convergence tolerance: optimization stops, conditional on some extra checks, when the absolute posterior increase over a full paramater set update is less than |
bf |
An indicator for whether or not to calculate the Bayes factor for univariate |
kill |
For choosing from multiple |
ord |
If |
verb |
A switch for controlling printed output. |
... |
Additional arguments to the undocumented internal |
Details
A latent topic model represents each i'th document's term-count vector
(with
total phrase count)
as having been drawn from a mixture of
K
multinomials, each parameterized by topic-phrase
probabilities , such that
We assign a K-dimensional Dirichlet(1/K) prior to each document's topic weights
, and the prior on each
is Dirichlet with concentration
.
The
topics
function uses quasi-newton accelerated EM, augmented with sequential quadratic programming
for conditional updates, to obtain MAP estimates for the topic model parameters.
We also provide Bayes factor estimation, from marginal likelihood
calculations based on a Laplace approximation around the converged MAP parameter estimates. If input
length(K)>1
, these
Bayes factors are used for model selection. Full details are in Taddy (2011).
Value
An topics
object list with entries
K |
The number of latent topics estimated. If input |
theta |
The |
omega |
The |
BF |
The log Bayes factor for each number of topics in the input |
D |
Residual dispersion: for each element of |
X |
The input count matrix, in |
Note
Estimates are actually functions of the MAP (K-1 or p-1)-dimensional logit transformed natural exponential family parameters.
Author(s)
Matt Taddy mataddy@gmail.com
References
Taddy (2012), On Estimation and Selection for Topic Models. http://arxiv.org/abs/1109.4518
See Also
plot.topics, summary.topics, predict.topics, wsjibm, congress109, we8there
Examples
## Simulation Parameters
K <- 10
n <- 100
p <- 100
omega <- t(rdir(n, rep(1/K,K)))
theta <- rdir(K, rep(1/p,p))
## Simulated counts
Q <- omega%*%t(theta)
counts <- matrix(ncol=p, nrow=n)
totals <- rpois(n, 100)
for(i in 1:n){ counts[i,] <- rmultinom(1, size=totals[i], prob=Q[i,]) }
## Bayes Factor model selection (should choose K or nearby)
summary(simselect <- topics(counts, K=K+c(-5:5)), nwrd=0)
## MAP fit for given K
summary( simfit <- topics(counts, K=K, verb=2), n=0 )
## Adjust for label switching and plot the fit (color by topic)
toplab <- rep(0,K)
for(k in 1:K){ toplab[k] <- which.min(colSums(abs(simfit$theta-theta[,k]))) }
par(mfrow=c(1,2))
tpxcols <- matrix(rainbow(K), ncol=ncol(theta), byrow=TRUE)
plot(theta,simfit$theta[,toplab], ylab="fitted values", pch=21, bg=tpxcols)
plot(omega,simfit$omega[,toplab], ylab="fitted values", pch=21, bg=tpxcols)
title("True vs Fitted Values (color by topic)", outer=TRUE, line=-2)
## The S3 method plot functions
par(mfrow=c(1,2))
plot(simfit, lgd.K=2)
plot(simfit, type="resid")