generatedata_mpin {PINstimation}R Documentation

Simulation of MPIN model data

Description

Generates a dataset object or a data.series object (a list of dataset objects) storing simulation parameters as well as aggregate daily buys and sells simulated following the assumption of the MPIN model of (Ersan 2016).

Usage

generatedata_mpin(series = 1, days = 60, layers = NULL,
                         parameters = NULL, ranges = list(), ...,
                         verbose = TRUE)

Arguments

series

The number of datasets to generate.

days

The number of trading days for which aggregated buys and sells are generated. Default value is 60.

layers

The number of information layers to be included in the simulated data. Default value is NULL. If layers is omitted or set to NULL, the number of layers is uniformly selected from the set {1, ..., maxlayers}.

parameters

A vector of model parameters of size 3J+2 where J is the number of information layers and it has the following form {\alpha1, ...,\alphaJ, \delta1,..., \deltaJ, \mu1,..., \muJ, \epsilonb, \epsilons}.

ranges

A list of ranges for the different simulation parameters having named elements \alpha, \delta, \epsilonb, \epsilons, and \mu. The value of each element is a vector of two numbers: the first one is the minimal value min_v and the second one is the maximal value max_v. If the element corresponding to a given parameter is missing, the default range for that parameter is used. If the argument ranges is an empty list and parameters is NULL, the default ranges for the parameters are used. The simulation parameters are uniformly drawn from the interval (min_v, max_v) for the specified parameters. The default value is list().

...

Additional arguments passed on to the function generatedata_mpin(). The recognized arguments are confidence, maxlayers, eps_ratio, mu_ratio.

  • confidence (numeric) denotes the range of the confidence interval associated with each layer such that all observations within the layer j lie in the theoretical confidence interval of the Skellam distribution centered on the mean order imbalance, at the level 'confidence'. The default value is 0.99.

  • maxlayers (integer) denotes the upper limit of number of layers for the generated datasets. If the argument layers is missing, the layers of the simulated datasets will be uniformly drawn from {1,..., maxlayers}. When missing, maxlayers takes the default value of 5.

  • eps_ratio (numeric) specifies the admissible range for the value of the ratio \epsilons/\epsilonb, It can be a two-value vector or just a single value. If eps_ratio is a vector of two values: the first one is the minimal value and the second one is the maximal value; and the function tries to generate \epsilons and \epsilonb satisfying that their ratios \epsilons/\epsilonb lies within the interval eps_ratio. If eps_ratio is a single number, then the function tries to generate \epsilons and \epsilonb satisfying \epsilons = \epsilonb x eps_ratio. If this range conflicts with other arguments such as ranges, a warning is displayed. The default value is c(0.75, 1.25).

  • mu_ratio (numeric) it is the minimal value of the ratio between two consecutive values of the vector mu. If mu_ratio = 1.25 e.g., then \muj+1 should be larger than 1.25* \muj for all ⁠j = 1, .., J⁠. If mu_ratio conflicts with other arguments such as ranges or confidence, a warning is displayed. The default value is NULL.

verbose

(logical) a binary variable that determines whether detailed information about the progress of the data generation is displayed. No output is produced when verbose is set to FALSE. The default value is TRUE.

Details

An information layer refers to a given type of information event existing in the data. The PIN model assumes a single type of information events characterized by three parameters for \alpha, \delta, and \mu. The MPIN model relaxes the assumption, by relinquishing the restriction on the number of information event types. When layers = 1, generated data fit the assumptions of the PIN model.

If the argument parameters is missing, then the simulation parameters are generated using the ranges specified in the argument ranges. If the argument ranges is list(), default ranges are used. Using the default ranges, the simulation parameters are obtained using the following procedure:

Based on the simulation parameters parameters, daily buys and sells are generated by the assumption that buys and sells follow Poisson distributions with mean parameters (\epsilonb, \epsilons) on days with no information; with mean parameters (\epsilonb + \muj, \epsilons) on days with good information of layer j and (\epsilonb, \epsilons + \muj) on days with bad information of layer j.

Considerations for the ranges of simulation parameters: While generatedata_mpin() function enables the user to simulate data series with any set of theoretical parameters, we strongly recommend the use of parameter sets satisfying below conditions which are in line with the nature of empirical data and the theoretical models used within this package. When parameter values are not assigned by the user, the function, by default, simulates data series that are in line with these criteria.

Value

Returns an object of class dataset if series=1, and an object of class data.series if series>1.

References

Cheng T, Lai H (2021). “Improvements in estimating the probability of informed trading models.” Quantitative Finance, 21(5), 771-796.

Ersan O (2016). “Multilayer Probability of Informed Trading.” Available at SSRN 2874420.

Examples

# ------------------------------------------------------------------------ #
# There are different scenarios of using the function generatedata_mpin()  #
# ------------------------------------------------------------------------ #

# With no arguments, the function generates one dataset object spanning
# 60 days, containing a number of information layers uniformly selected
# from `{1, 2, 3, 4, 5}`, and where the parameters are chosen as
# described in the details.

sdata <- generatedata_mpin()

# The number of layers can be deduced from the simulation parameters, if
# fed directly to the function generatedata_mpin() through the argument
# 'parameters'. In this case, the output is a dataset object with one
# information layer.

givenpoint <- c(0.4, 0.1, 800, 300, 200)
sdata <- generatedata_mpin(parameters = givenpoint)

# The number of layers can alternatively be set directly through the
# argument 'layers'.

sdata <- generatedata_mpin(layers = 2)

# The simulation parameters can be randomly drawn from their corresponding
# ranges fed through the argument 'ranges'.

sdata <- generatedata_mpin(ranges = list(alpha = c(0.1, 0.7),
                                        delta = c(0.2, 0.7),
                                        mu = c(3000, 5000)))

# The value of a given simulation parameter can be set to a specific value by
# setting the range of the desired parameter takes a unique value, instead of
# a pair of values.

sdata <- generatedata_mpin(ranges = list(alpha = 0.4, delta = c(0.2, 0.7),
                                        eps.b = c(100, 7000),
                                        mu = c(8000, 12000)))

# If both arguments 'parameters', and 'layers' are simultaneously provided,
# and the number of layers detected from the length of the argument
# 'parameters' is different from the argument 'layers', the former is used
# and a warning is displayed.

sim.params <- c(0.4, 0.2, 0.9, 0.1, 400, 700, 300, 200)
sdata <- generatedata_mpin(days = 120, layers = 3, parameters = sim.params)

# Display the details of the generated data

show(sdata)

# ------------------------------------------------------------------------ #
# Use generatedata_mpin() to compare the accuracy of estimation methods    #
# ------------------------------------------------------------------------ #

# The example below illustrates the use of the function 'generatedata_mpin()'
# to compare the accuracy of the functions 'mpin_ml()', and 'mpin_ecm()'.

# The example will depend on three variables:
# n: the number of datasets used
# l: the number of layers in each simulated datasets
# xc : the number of extra clusters used in initials_mpin

# For consideration of speed, we will set n = 2, l = 2, and xc = 2
# These numbers can change to fit the user's preferences
n <- l <- xc <- 2

# We start by generating n datasets simulated according to the
# assumptions of the MPIN model.

dataseries <- generatedata_mpin(series = n, layers = l, verbose = FALSE)

# Store the estimates in two different lists: 'mllist', and 'ecmlist'

mllist <- lapply(dataseries@datasets, function(x)
  mpin_ml(x@data, xtraclusters = xc, layers = l, verbose = FALSE))

ecmlist <- lapply(dataseries@datasets, function(x)
  mpin_ecm(x@data, xtraclusters = xc, layers = l, verbose = FALSE))

# For each estimate, we calculate the absolute difference between the
# estimated mpin, and empirical mpin computed using dataset parameters.
# The absolute differences are stored in 'mldmpin' ('ecmdpin') for the
# ML (ECM) method,

mldpin <- sapply(1:n,
 function(x) abs(mllist[[x]]@mpin - dataseries@datasets[[x]]@emp.pin))

ecmdpin <- sapply(1:n,
 function(x) abs(ecmlist[[x]]@mpin - dataseries@datasets[[x]]@emp.pin))

# Similarly, we obtain vectors of running times for both estimation methods.
# They are stored in 'mltime' ('ecmtime') for the ML (ECM) method.

mltime <- sapply(mllist, function(x) x@runningtime)
ecmtime <- sapply(ecmlist, function(x) x@runningtime)

# Finally, we calculate the average absolute deviation from empirical PIN
# as well as the average running time for both methods. This allows us to
# compare them in terms of accuracy, and speed.

accuracy <- c(mean(mldpin), mean(ecmdpin))
timing <- c(mean(mltime), mean(ecmtime))
comparison <- as.data.frame(rbind(accuracy, timing))
colnames(comparison) <- c("ML", "ECM")
rownames(comparison) <- c("Accuracy", "Timing")

show(round(comparison, 6))


[Package PINstimation version 0.1.2 Index]