match.template {matchFeat}R Documentation

Template Matching

Description

This function solves the multidimensional assignment problem with decomposable costs (MDADC) by matching the data to a pre-specified set of vectors (the template). The dissimilarity function is the squared Euclidean distance.

Usage

match.template(x, template = 1L, unit = NULL, w = NULL, 
	method = c("hungarian","bruteforce"), equal.variance = FALSE)

Arguments

x

data: matrix of dimensions mn \times p or 3D array of dimensions (p,m,n) with m = number of labels/classes, n = number of sample units, and p = number of variables)

template

integer (= which sample unit to take as template) or (p,m) matrix

unit

integer (=number of units/datasets) or vector mapping rows of x to sample units (length mn). Must be specified only if x is a matrix.

w

weights for the loss function. Can be specified as a p-vector of diagonal weights or as a full p \times p (positive definite) matrix

method

method for the linear assignment problem: hungarian algorithm or bruteforce

equal.variance

logical; if TRUE, resp. FALSE, return common, resp. label-specific, covariance of matched features

Details

Given n datasets or statistical units, each containing m feature vectors, the one-to-one matching problem is to find a set of n label permutations that produce the best match of feature vectors across units. The objective function to minimize is the sum of squared (Euclidean) distances between all feature vectors having the same (new) label. This amounts to minimizing the sum of the within-label variances.

The template-based method consists in relabeling successively each sample unit to best match a template matrix of feature vectors. This method is very fast but its optimization performance is only as good as the template. For best results, the template should be representative of the collected data.

If x is a matrix, the rows should be sorted by increasing unit label and unit should be a nondecreasing sequence of integers, for example (1,...,1,2,...,2,...,n,...,n) with each integer 1,...,n replicated m times.

The argument w can be specified as a vector of positive numbers (will be recycled to length p if needed) or as a positive definite matrix of size (p,p).

Value

A list of class matchFeat with fields

sigma

best assignement as set of permutations (m\times n matrix)

cluster

best assignement as cluster indicators (m \times n matrix)

objective

minimum objective value

mu

mean vector for each class/label (p \times m matrix)

V

covariance matrix for each class/label (p \times p \times m array if equal.variance is FALSE, p \times p matrix otherwise

call

function call

References

Degras (2022) "Scalable feature matching across large data collections." doi:10.1080/10618600.2022.2074429
https://en.wikipedia.org/wiki/Assignment_problem
https://en.wikipedia.org/wiki/Hungarian_algorithm

See Also

match.2x, match.bca, match.bca.gen, match.gaussmix, match.kmeans, match.rec

Examples

## Generate data
n <- 10
k <- 3
d <- 5
mu <- matrix(1:k, nrow=d, ncol=k, byrow=TRUE)
sigma <- 0.3
x <- array(mu, c(d,k,n)) + rnorm(d*k*n,sigma)
## Match all feature vectors with first case as template
result <- match.template(x,1)
## Display results 
result$cost # cost function
xmatched <- array(dim=dim(x)) 
# re-arranged (matched) feature vectors
for (i in 1:n)
	xmatched[,,i] <- x[,result$sigma[,i],i]

[Package matchFeat version 1.0 Index]