seriate {seriation}R Documentation

Seriate Dissimilarity Matrices, Matrices or Arrays

Description

Tries to find a linear order for objects using data in the form of a dissimilarity matrix (two-way one-mode data), a data matrix (two-way two-mode data), or a data array (k-way k-mode data). The order can then be used to reorder the dissimilarity matrix/data matrix using permute().

Usage

seriate(x, ...)

## S3 method for class 'dist'
seriate(x, method = "Spectral", control = NULL, rep = 1L, ...)

## S3 method for class 'matrix'
seriate(x, method = "PCA", control = NULL, margin = c(1L, 2L), rep = 1L, ...)

## S3 method for class 'array'
seriate(
  x,
  method = "PCA",
  control = NULL,
  margin = seq(length(dim(x))),
  rep = 1L,
  ...
)

## S3 method for class 'data.frame'
seriate(
  x,
  method = "Heatmap",
  control = NULL,
  margin = c(1L, 2L),
  rep = 1L,
  ...
)

## S3 method for class 'table'
seriate(x, method = "CA", control = NULL, margin = c(1L, 2L), ...)

Arguments

x

the data.

...

further arguments are added to the control list.

method

a character string with the name of the seriation method (default: varies by data type).

control

a list of control options passed on to the seriation algorithm.

rep

number of random restarts for randomized methods. Uses seriate_rep().

margin

an integer vector giving the margin indices (dimensions) to be seriated. For example, for a matrix, 1 indicates rows, 2 indicates columns, c(1 ,2) means rows and columns. Unseriated margins return the identity seriation order for that margin.

Details

Seriation methods are managed via a registry. See list_seriation_methods() for help. In the following, we focus on discussing the built-in methods that are registered automatically by the package seriation.

The available control options, default settings, and a description for each algorithm can be retrieved using get_seriation_method(name = "<seriation method>"). Some control parameters are also described in more detail below.

Some methods are very slow, and progress can be printed using the control parameter verbose = TRUE.

Many seriation methods (heuristically) optimize (minimize or maximize) an objective function often called seriation criterion. The value of the seriation criterion for a given order can be calculated using criterion(). In this manual page, we include the criterion, which is optimized by each method using bold font. If no criterion is mentioned, then the method does not directly optimize a criterion. A definition of the different seriation criteria can be found on the criterion() manual page.

Seriation methods for distance matrices (dist)

One-mode two-way data must be provided as a dist object (not a symmetric matrix). Similarities have to be transformed into dissimilarities. Seriation algorithms fall into different groups based on the approach. In the following, we describe the currently implemented methods. A list with all methods and the available parameters is available here. Hahsler (2017) for a more detailed description and an experimental comparison of the most popular methods.

Dendrogram leaf order

These methods create a dendrogram using hierarchical clustering and then derive the seriation order from the leaf order in the dendrogram. Leaf reordering may be applied.

Dimensionality reduction

Find a seriation order by reducing the dimensionality to 1 dimension. This is typically done by minimizing a stress measure or the reconstruction error. Note that dimensionality reduction to a single dimension is a very difficult discrete optimization problem. For example, MDS algorithms used for a single dimension tend to end up in local optima (see Maier and De Leeuw, 2015). However, generally, ordering along a single component of MDS provides good results sufficient for applications like visualization.

Optimization

These methods try to optimize a seriation criterion directly, typically using a heuristic approach.

Other Methods

Seriation methods for matrices (matrix)

Two-mode two-way data are general matrices. Some methods also require that the matrix is positive. Data frames and contingency tables (base::table) are converted into a matrix. However, the default methods are different.

Some methods find the row and column order simultaneously, while others calculate them independently. Currently, the following methods are implemented for matrix:

Seriating rows and columns simultaneously

Row and column order influence each other.

Seriating rows and columns separately using dissimilarities

Seriate rows using the data matrix

These methods need access to the data matrix instead of dissimilarities to reorder objects (rows). Columns can also be reorderd by applying the same technique to the transposed data matrix.

Performs 1D the non-linear dimensionality reduction method locally linear embedding (see lle()).

Other methods

For general arrays no built-in methods are currently available.

Value

Returns an object of class ser_permutation.

Author(s)

Michael Hahsler

References

Arabie, P. and L.J. Hubert (1990): The bond energy algorithm revisited, IEEE Transactions on Systems, Man, and Cybernetics, 20(1), 268–274. doi:10.1109/21.47829

Bar-Joseph, Z., E. D. Demaine, D. K. Gifford, and T. Jaakkola. (2001): Fast Optimal Leaf Ordering for Hierarchical Clustering. Bioinformatics, 17(1), 22–29. doi:10.1093/bioinformatics/17.suppl_1.S22

Barnard, S. T., A. Pothen, and H. D. Simon (1993): A Spectral Algorithm for Envelope Reduction of Sparse Matrices. In Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, 493–502. Supercomputing '93. New York, NY, USA: ACM. https://ieeexplore.ieee.org/document/1263497

Bezdek, J.C. and Hathaway, R.J. (2002): VAT: a tool for visual assessment of (cluster) tendency. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN '02), Volume: 3, 2225–2230. doi:10.1109/IJCNN.2002.1007487

Brusco, M., Koehn, H.F., and Stahl, S. (2008): Heuristic Implementation of Dynamic Programming for Matrix Permutation Problems in Combinatorial Data Analysis. Psychometrika, 73(3), 503–522. doi:10.1007/s11336-007-9049-5

Brusco, M., and Stahl, S. (2005): Branch-and-Bound Applications in Combinatorial Data Analysis. New York: Springer. doi:10.1007/0-387-28810-4

Chen, C. H. (2002): Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices. Statistica Sinica, 12(1), 7–29.

Ding, C. and Xiaofeng He (2004): Linearized cluster assignment via spectral ordering. Proceedings of the Twenty-first International Conference on Machine learning (ICML '04). doi:10.1145/1015330.1015407

Climer, S. and Xiongnu Zhang (2006): Rearrangement Clustering: Pitfalls, Remedies, and Applications, Journal of Machine Learning Research, 7(Jun), 919–943.

D. Earle, C. B. Hurley (2015): Advances in dendrogram seriation for application to visualization. Journal of Computational and Graphical Statistics, 24(1), 1–25.

Friendly, M. (2002): Corrgrams: Exploratory Displays for Correlation Matrices. The American Statistician, 56(4), 316–324. doi:10.1198/000313002533

Gruvaeus, G. and Wainer, H. (1972): Two Additions to Hierarchical Cluster Analysis, British Journal of Mathematical and Statistical Psychology, 25, 200–206. doi:10.1111/j.2044-8317.1972.tb00491.x

Hahsler, M. (2017): An experimental comparison of seriation methods for one-mode two-way data. European Journal of Operational Research, 257, 133–143. doi:10.1016/j.ejor.2016.08.066

Hubert, Lawrence, and James Schultz (1976): Quadratic Assignment as a General Data Analysis Strategy. British Journal of Mathematical and Statistical Psychology, 29(2). Blackwell Publishing Ltd. 190–241. doi:10.1111/j.2044-8317.1976.tb00714.x

Hurley, Catherine B. (2004): Clustering Visualizations of Multidimensional Data. Journal of Computational and Graphical Statistics, 13(4), 788–806. doi:10.1198/106186004X12425

Kruskal, J.B. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29, 115–129.

Lenstra, J.K (1974): Clustering a Data Array and the Traveling-Salesman Problem, Operations Research, 22(2) 413–414. doi:10.1287/opre.22.2.413

Mair P., De Leeuw J. (2015). Unidimensional scaling. In Wiley StatsRef: Statistics Reference Online, Wiley, New York. doi:10.1002/9781118445112.stat06462.pub2

McCormick, W.T., P.J. Schweitzer and T.W. White (1972): Problem decomposition and data reorganization by a clustering technique, Operations Research, 20(5), 993–1009. doi:10.1287/opre.20.5.993

Tenenbaum, J.B., de Silva, V. & Langford, J.C. (2000) A global network framework for nonlinear dimensionality reduction. Science 290, 2319-2323.

Tsafrir, D., Tsafrir, I., Ein-Dor, L., Zuk, O., Notterman, D.A. and Domany, E. (2005): Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices, Bioinformatics, 21(10) 2301–8. doi:10.1093/bioinformatics/bti329

Sammon, J. W. (1969) A non-linear mapping for data structure analysis. IEEE Trans. Comput., C-18 401–409.

See Also

Other seriation: register_DendSer(), register_GA(), register_optics(), register_smacof(), register_tsne(), register_umap(), registry_for_seriaiton_methods, seriate_best()

Examples

# Show available seriation methods (for dist and matrix)
list_seriation_methods()

# show the description for ARSA
get_seriation_method("dist", name = "ARSA")

### Seriate as distance matrix (for 50 flowers from the iris dataset)
data("iris")
x <- as.matrix(iris[-5])
x <- x[sample(nrow(x), size = 50), ]
d <- dist(x)

order <- seriate(d)
order

pimage(d, main = "Distances (Random Order)")
pimage(d, order, main = "Distances (Reordered)")

# Compare seriation quality
rbind(
        random = criterion(d),
        reordered = criterion(d, order)
     )

# Reorder the distance matrix
d_reordered <-  permute(d, order)
pimage(d_reordered, main = "Distances (Reordered)")


### Seriate a matrix (50 flowers from iris)

# To make the variables comparable, we scale the data
x <- scale(x, center = FALSE)

# The iris flowers are ordered by species in the data set
pimage(x, main = "original data", prop = FALSE)
criterion(x)

# Apply some methods
order <- seriate(x, method = "BEA_TSP")
pimage(x, order, main = "TSP to optimize ME", prop = FALSE)
criterion(x, order)

order <- seriate(x, method = "PCA")
pimage(x, order, main = "First principal component", prop = FALSE)
criterion(x, order)

order <- seriate(x, method = "heatmap")
pimage(x, order, main = "Heatmap seriation", prop = FALSE)
criterion(x, order)

# reorder the matrix
x_reordered <- permute(x, order)

# create a heatmap seriation manually by calculating
# distances between rows and between columns
order <- c(
    seriate(dist(x), method = "OLO"),
    seriate(dist(t(x)), method = "OLO")
)
pimage(x, order, main = "Heatmap seriation", prop = FALSE)
criterion(x, order)

### Seriate a correlation matrix
corr <- cor(x)

# plot in original order
pimage(corr, main = "Correlation matrix")

# reorder the correlation matrix using the angle of eigenvectors
pimage(corr, order = "AOE", main = "Correlation matrix (AOE)")

# we can also define a distance (we used d = sqrt(1 - r)) and
# then reorder the matrix (rows and columns) using any seriation method.
d <- as.dist(sqrt(1 - corr))
o <- seriate(d, method = "R2E")
corr_reordered <- permute(corr, order = c(o, o))
pimage(corr_reordered, main = "Correlation matrix (R2E)")

[Package seriation version 1.5.5 Index]