R: Feature extraction via multidimensional scaling

seq2feature_mds {ProcData}

R Documentation

Feature extraction via multidimensional scaling

Description

seq2feature_mds extracts K features from response processes by multidimensional scaling.

Usage

seq2feature_mds(seqs = NULL, K = 2, method = "auto",
  dist_type = "oss_action", pca = TRUE, subset_size = 100,
  subset_method = "random", n_cand = 10, return_dist = FALSE,
  L_set = 1:3)

Arguments

`seqs`	a `"proc"` object or a square matrix. If a squared matrix is provided, it is treated as the dissimilary matrix of a group of response processes.
`K`	the number of features to be extracted.
`method`	a character string specifies the algorithm used for performing MDS. See 'Details'.
`dist_type`	a character string specifies the dissimilarity measure for two response processes. See 'Details'.
`pca`	logical. If `TRUE` (default), the principal components of the extracted features are returned.
`subset_size`, `n_cand`	two parameters used in the large data algorithm. See 'Details' and `seq2feature_mds_large`.
`subset_method`	a character string specifying the method for choosing the subset in the large data algorithm. See 'Details' and `seq2feature_mds_large`.
`return_dist`	logical. If `TRUE`, the dissimilarity matrix will be returned. Default is `FALSE`.
`L_set`	length of ngrams considered

Details

Since the classical MDS has a computational complexity of order n^3 where n is the number of response processes, it is computational expensive to perform classical MDS when a large number of response processes is considered. In addition, storing an n \times n dissimilarity matrix when n is large require a large amount of memory. In seq2feature_mds, the algorithm proposed in Paradis (2018) is implemented to obtain MDS for large datasets. method specifies the algorithm to be used for obtaining MDS features. If method = "small", classical MDS is used by calling cmdscale. If method = "large", the algorithm for large datasets will be used. If method = "auto" (default), seq2feature_mds selects the algorithm automatically based on the sample size.

dist_type specifies the dissimilarity to be used for measuring the discrepancy between two response processes. If dist_type = "oss_action", the order-based sequence similarity (oss) proposed in Gomez-Alonso and Valls (2008) is used for action sequences. If dist_type = "oss_both", both action sequences and timestamp sequences are used to compute a time-weighted oss.

The number of features to be extracted K can be selected by cross-validation using chooseK_mds.

Value

seq2feature_mds returns a list containing

`theta`	a numeric matrix giving the `K` extracted features or principal features. Each column is a feature.
`dist_mat`	the dissimilary matrix. This element exists only if `return_dist=TRUE`.

References

Gomez-Alonso, C. and Valls, A. (2008). A similarity measure for sequences of categorical data based on the ordering of common elements. In V. Torra & Y. Narukawa (Eds.) Modeling Decisions for Artificial Intelligence, (pp. 134-145). Springer Berlin Heidelberg.

Paradis, E. (2018). Multidimensional scaling with very large datasets. Journal of Computational and Graphical Statistics, 27(4), 935-939.

Tang, X., Wang, Z., He, Q., Liu, J., and Ying, Z. (2020) Latent Feature Extraction for Process Data via Multidimensional Scaling. Psychometrika, 85, 378-397.

Examples

n <- 50
set.seed(12345)
seqs <- seq_gen(n)
theta <- seq2feature_mds(seqs, 5)$theta

[Package ProcData version 0.3.2 Index]