R: Gaussian Mixture Approach to One-To-One Feature Matching

match.gaussmix {matchFeat}

R Documentation

Gaussian Mixture Approach to One-To-One Feature Matching

Description

This function performs maximum likelihood estimation (MLE) for the one-to-one feature matching problem represented as a multivariate Gaussian mixture model. MLE is carried out via the EM algorithm.

Usage

match.gaussmix(x, unit = NULL, mu = NULL, V = NULL, equal.variance = FALSE, 
	method = c("exact", "approx"), fixed = FALSE, control = list())

Arguments

`x`	data: matrix of dimensions `(mn,p)` or array of dimensions `(p,m,n)` with `m` = number of labels/classes, `n` = number of sample units, and `p` = number of variables)
`unit`	integer (=number of units) or vector mapping rows of `x` to sample units (length `mn`). Must be specified only if `x` is a matrix.
`mu`	matrix of initial estimates of mean vectors (dimension `(p,m)`)
`V`	array of initial estimates of covariance matrices (dimension `(p,p,m)`)
`equal.variance`	logical: if `TRUE`, all covariance matrices are assumed to be equal
`method`	method for calculating class probabilities of feature vectors
`fixed`	logical; if `TRUE`, the model parameters `mu` and `V` are fixed to their initial values
`control`	list of tuning parameters

Details

Given a sample of n statistical units, each having m possibly mislabeled feature vectors, the one-to-one matching problem is to find a set of n label permutations that produce the best match of feature vectors across units. This problem is sometimes referred to as "data association ambiguity".

The feature vectors of all units are represented as independent realizations of m multivariate normal distributions with unknown parameters. For each sample unit, exactly one vector from each distribution is observed and the m corresponding labels are randomly permuted. The goal is to estimate the true class of each feature vector, as well as the mean vector and covariance matrix of each distribution. These quantities are evaluated by ML estimation via the Expectation-Maximization (EM) algorithm.

If x is a matrix, the rows should be sorted by increasing unit label and unit should be a nondecreasing sequence of integers, for example (1,...,1,2,...,2,...,n,...,n) with each integer 1,...,n replicated m times.

The arguments mu and V should be specified only if a good guess is available for these parameters. Otherwise bad starting values may cause the EM algorithm to converge to a local maximum of the likelihood function quite far from the global maximum.

If method is set to exact (default), the class probabilities of the feature vectors (given the data) are calculated exactly at each iteration of the EM algorithm. This operation can be slow as it involves calculating the permanent of matrices. The argument method can be set to approximate to speed up calculations, but this option is not recommended in general as the approximations used are very crude and may produce "bad" EM solutions.

The optional argument control can be specified with these fields: maxit, maximum number of EM iterations (default=1e4); eps, relative tolerance for EM convergence (default=1e-8), the EM algorithm stops if the relative increase in log-likelihood between two iterations is less than this tolerance; verbose, set to TRUE to display algorithm progress (default=FALSE).

Value

A list of class matchFeat with fields

sigma: permutations that best match feature vectors across units ((m,n) matrix)
cluster: associated clusters (=inverse permutations)
P: conditional probability that a feature vector is assigned to its 'true' label ((m,n) matrix)
mu: MLE of true mean vectors ((p,m) matrix)
V: MLE of true covariance matrices ((p,p,m) array or (p,p)matrix if equal.variance=TRUE)
loglik: Maximum value of log-likelihood

References

Degras (2022) "Scalable feature matching across large data collections." doi:10.1080/10618600.2022.2074429
'McLachlan and Krishnan (2008). The EM Algorithm and Extensions.'

Examples


data(optdigits)
x <- optdigits$x
label <- optdigits$label
m <- length(unique(label))
n <- length(unique(optdigits$unit))

## Randomly permute labels to make problem harder
for (i in 1:n)
{
	idx <- seq.int((i-1) * m + 1, i * m)
	sigma <- sample.int(m)
	x[idx,] <- x[idx[sigma],]
	label[idx] <- label[idx[sigma]]
}

## Fit Gaussian mixture model
fit <- match.gaussmix(x, unit = n)

## Calculate Rand index
Rand.index(fit$cluster,label)

[Package matchFeat version 1.0 Index]