match.gaussmix {matchFeat} | R Documentation |
Gaussian Mixture Approach to One-To-One Feature Matching
Description
This function performs maximum likelihood estimation (MLE) for the one-to-one feature matching problem represented as a multivariate Gaussian mixture model. MLE is carried out via the EM algorithm.
Usage
match.gaussmix(x, unit = NULL, mu = NULL, V = NULL, equal.variance = FALSE,
method = c("exact", "approx"), fixed = FALSE, control = list())
Arguments
x |
data: matrix of dimensions |
unit |
integer (=number of units) or vector mapping rows of |
mu |
matrix of initial estimates of mean vectors (dimension |
V |
array of initial estimates of covariance matrices (dimension |
equal.variance |
logical: if |
method |
method for calculating class probabilities of feature vectors |
fixed |
logical; if |
control |
list of tuning parameters |
Details
Given a sample of n
statistical units, each having m
possibly mislabeled feature vectors, the one-to-one matching problem is to find a set of n
label permutations that produce the best match of feature vectors across units. This problem is sometimes referred to as "data association ambiguity".
The feature vectors of all units are represented as independent realizations of m
multivariate normal distributions with unknown parameters. For each sample unit, exactly one vector from each distribution is observed and the m
corresponding labels are randomly permuted. The goal is to estimate the true class of each feature vector, as well as the mean vector and covariance matrix of each distribution. These quantities are evaluated by ML estimation via the Expectation-Maximization (EM) algorithm.
If x
is a matrix, the rows should be sorted by increasing unit label and unit
should be a nondecreasing sequence of integers, for example (1,...,1,2,...,2,...,n,...,n)
with each integer 1,...,n
replicated m
times.
The arguments mu
and V
should be specified only if a good guess is available for these parameters. Otherwise bad starting values may cause the EM algorithm to converge to a local maximum of the likelihood function quite far from the global maximum.
If method
is set to exact
(default), the class probabilities of the feature vectors (given the data) are calculated exactly at each iteration of the EM algorithm. This operation can be slow as it involves calculating the permanent of matrices. The argument method
can be set to approximate
to speed up calculations, but this option is not recommended in general as the approximations used are very crude and may produce "bad" EM solutions.
The optional argument control
can be specified with these fields:
maxit
, maximum number of EM iterations (default=1e4);
eps
, relative tolerance for EM convergence (default=1e-8),
the EM algorithm stops if the relative increase in log-likelihood between two iterations is less than this tolerance; verbose
, set to TRUE to display
algorithm progress (default=FALSE).
Value
A list of class matchFeat
with fields
sigma
permutations that best match feature vectors across units (
(m,n)
matrix)cluster
associated clusters (=inverse permutations)
P
conditional probability that a feature vector is assigned to its 'true' label (
(m,n)
matrix)mu
MLE of true mean vectors (
(p,m)
matrix)V
MLE of true covariance matrices (
(p,p,m)
array or(p,p)
matrix ifequal.variance=TRUE
)loglik
Maximum value of log-likelihood
References
Degras (2022) "Scalable feature matching across large data collections."
doi:10.1080/10618600.2022.2074429
'McLachlan and Krishnan (2008). The EM Algorithm and Extensions.'
See Also
match.2x
,
match.bca
,
match.bca.gen
,
match.kmeans
,
match.rec
,
match.template
Examples
data(optdigits)
x <- optdigits$x
label <- optdigits$label
m <- length(unique(label))
n <- length(unique(optdigits$unit))
## Randomly permute labels to make problem harder
for (i in 1:n)
{
idx <- seq.int((i-1) * m + 1, i * m)
sigma <- sample.int(m)
x[idx,] <- x[idx[sigma],]
label[idx] <- label[idx[sigma]]
}
## Fit Gaussian mixture model
fit <- match.gaussmix(x, unit = n)
## Calculate Rand index
Rand.index(fit$cluster,label)