R: Create a Synthetic Data Stream

synthetic_stream {rEMM}

R Documentation

Create a Synthetic Data Stream

Description

This function creates a synthetic data stream with data points in roughly [0, 1]^p by choosing points form k clusters following a sequence through these clusters. Each cluster has a density function following a d-dimensional normal distributions. In the test set outliers are introduced.

Usage

synthetic_stream(k = 10, d = 2, n_subseq = 100, p_transition = 0.5, p_swap = 0,
n_train = 5000, n_test = 1000, p_outlier = 0.01, rangeVar = c(0, 0.005))

Arguments

`k`	number of clusters.
`d`	dimensionality of data set.
`n_subseq`	length of subsequence which will be repeat to create the data set.
`p_transition`	probability that the next position in the subsequence will belong to a different cluster.
`p_swap`	probability that two data points are swapped. This represents measurement errors (e.g., a data points arrive out of order) or that the data stream does not exactly follow the subsequence.
`n_train`	size of training set (without outliers).
`n_test`	size of test set (with outliers).
`p_outlier`	probability that a data point is replaced by an outlier (a randomly chosen point in `[0,1]^p`).
`rangeVar`	Used to create the random covariance matrices for the clusters. See `genPositiveDefMat()` in clusterGeneration for details.

Details

The data generation process creates a data set consisting of k clusters in roughly [0,1]^d. The data points for each cluster are be drawn from a multivariate normal distribution given a random mean and a random variance/covariance matrix for each cluster. The temporal aspect is modeled by a fixed subsequence (of length n_subseq) through the k clusters. In each step in the subsequence we have a transition probability p_transition that the next data point is in the same cluster or in a randomly chosen other cluster, thus we can create slowly or fast changing data. For the complete sequence, the subsequence is repeated to create n_test/n_train data points. The data set is generated by drawing a data point from the cluster corresponding to each position in the sequence. Outliers are introduced by replacing data points in the data set with probability $p_outlier by randomly chosen data points in [0,1]^d. Finally, to introduce imperfection in the temporal sequence (e.g., because the data does not follow exactly a repeating sequence or because observations do not arrive in the correct order), we swap two consecutive observations with probability p_swap.

Value

A list with the following elements:

`test`	test data.
`train`	training data.
`sequence_test`	sequence of the test data points through the clusters.
`sequence_train`	sequence of the training data points through the clusters.
`swap_test`	index where points are swapped.
`swap_train`	index where points are swapped.
`outlier_position`	logical vector for outliers in test data.
`model`	centers and covariance matrices for the clusters.

Examples

## create only test data (with outliers)
ds <- synthetic_stream(n_train = 0)

## plot test data
plot(ds$test, pch = ds$sequence_test, col = "gray")
text(ds$model$mu[, 1], ds$model$mu[, 2], 1:10)

## mark outliers
points(ds$test[ds$outlier_position, ],
  pch = 3, lwd = 2, col = "red")

[Package rEMM version 1.2.1 Index]