seq2feature_mds {ProcData}R Documentation

Feature extraction via multidimensional scaling

Description

seq2feature_mds extracts K features from response processes by multidimensional scaling.

Usage

seq2feature_mds(seqs = NULL, K = 2, method = "auto",
  dist_type = "oss_action", pca = TRUE, subset_size = 100,
  subset_method = "random", n_cand = 10, return_dist = FALSE,
  L_set = 1:3)

Arguments

seqs

a "proc" object or a square matrix. If a squared matrix is provided, it is treated as the dissimilary matrix of a group of response processes.

K

the number of features to be extracted.

method

a character string specifies the algorithm used for performing MDS. See 'Details'.

dist_type

a character string specifies the dissimilarity measure for two response processes. See 'Details'.

pca

logical. If TRUE (default), the principal components of the extracted features are returned.

subset_size, n_cand

two parameters used in the large data algorithm. See 'Details' and seq2feature_mds_large.

subset_method

a character string specifying the method for choosing the subset in the large data algorithm. See 'Details' and seq2feature_mds_large.

return_dist

logical. If TRUE, the dissimilarity matrix will be returned. Default is FALSE.

L_set

length of ngrams considered

Details

Since the classical MDS has a computational complexity of order n^3 where n is the number of response processes, it is computational expensive to perform classical MDS when a large number of response processes is considered. In addition, storing an n \times n dissimilarity matrix when n is large require a large amount of memory. In seq2feature_mds, the algorithm proposed in Paradis (2018) is implemented to obtain MDS for large datasets. method specifies the algorithm to be used for obtaining MDS features. If method = "small", classical MDS is used by calling cmdscale. If method = "large", the algorithm for large datasets will be used. If method = "auto" (default), seq2feature_mds selects the algorithm automatically based on the sample size.

dist_type specifies the dissimilarity to be used for measuring the discrepancy between two response processes. If dist_type = "oss_action", the order-based sequence similarity (oss) proposed in Gomez-Alonso and Valls (2008) is used for action sequences. If dist_type = "oss_both", both action sequences and timestamp sequences are used to compute a time-weighted oss.

The number of features to be extracted K can be selected by cross-validation using chooseK_mds.

Value

seq2feature_mds returns a list containing

theta

a numeric matrix giving the K extracted features or principal features. Each column is a feature.

dist_mat

the dissimilary matrix. This element exists only if return_dist=TRUE.

References

Gomez-Alonso, C. and Valls, A. (2008). A similarity measure for sequences of categorical data based on the ordering of common elements. In V. Torra & Y. Narukawa (Eds.) Modeling Decisions for Artificial Intelligence, (pp. 134-145). Springer Berlin Heidelberg.

Paradis, E. (2018). Multidimensional scaling with very large datasets. Journal of Computational and Graphical Statistics, 27(4), 935-939.

Tang, X., Wang, Z., He, Q., Liu, J., and Ying, Z. (2020) Latent Feature Extraction for Process Data via Multidimensional Scaling. Psychometrika, 85, 378-397.

See Also

chooseK_mds for choosing K.

Other feature extraction methods: aseq2feature_seq2seq, atseq2feature_seq2seq, seq2feature_mds_large, seq2feature_ngram, seq2feature_seq2seq, tseq2feature_seq2seq

Examples

n <- 50
set.seed(12345)
seqs <- seq_gen(n)
theta <- seq2feature_mds(seqs, 5)$theta

[Package ProcData version 0.3.2 Index]