R: Simulated Data Generator and Inferential Comparison

dfba_sim_data {DFBA}

R Documentation

Simulated Data Generator and Inferential Comparison

Description

This function is designed to be called by other DFBA programs that compare frequentist and Bayesian power. The function generates simulated data for two conditions that can be from nine different probability models. The program also computes the frequentist p-value from a t-test on the generated data, and it computes the Bayesian posterior probability from a distribution-free analysis of the difference between the two conditions.

Usage

dfba_sim_data(
  n = 20,
  a0 = 1,
  b0 = 1,
  model,
  design,
  delta,
  shape1 = 1,
  shape2 = 1,
  block_max = 0
)

Arguments

`n`	Number of values per condition
`a0`	The first shape parameter for the prior beta distribution (default is 1). Must be positive and finite.
`b0`	The second shape parameter for the prior beta distribution (default is 1). Must be positive and finite.
`model`	Theoretical probability model for the data. One of `"normal"`, `"weibull"`, `"cauchy"`, `"lognormal"`, `"chisquare"`, `"logistic"`, `"exponential"`, `"gumbel"`, or `"pareto"`
`design`	Indicates the data structure. One of `"independent"` or `"paired"`.
`delta`	Theoretical mean difference between conditions; the second condition minus the first condition
`shape1`	The shape parameter for condition 1 for the distribution indicated by `model` input (default is 1)
`shape2`	The shape parameter for condition 2 for the distribution indicated by `model` input (default is 1)
`block_max`	The maximum size for a block effect (default is 0)

Details

Researchers need to make experimental-design decisions such as the choice about the sample size per condition and the decision to use a within-block design or an independent-group design. These planning issues arise regardless if one uses either a frequentist or Bayesian approach to statistical inference. In the DFBA package, there are a number of functions to help users with these decisions.

The dfba_sim_data() program is used along with other functions to assess the relative power for detecting a condition difference of an amount delta between two conditions. Delta is an input for the dfba_sim_data() program, and it must be a nonnegative value. Specifically, the dfba_sim_data() program generates two sets of data that are randomly drawn from one of nine different theoretical models. The input ‘model’ stipulates the data generating probability function. The input ‘model’ is one of the following strings:

"normal"
"weibull"
"cauchy"
"lognormal"
"chisquare"
"logistic"
"exponential"
"gumbel"
"pareto"

For each model there are n continuous scores randomly sampled for each condition, where n is a user-specified input value. The design argument is either "independent" or "paired", and stipulates whether the two sets of scores are either independent or from a common blocks such as for the case of two scores for the same person (i.e., one in each condition).

The shape1 and shape2 arguments are values for the shape parameter for the respective first and second condition, and their meaning depends on the probability model. For model="normal", these parameters are the standard deviations of the two distributions. For model = "weibull", the parameters are the Weibull shape parameters. For model = "cauchy", the parameters are the scale factors for the Cauchy distributions. For model = "lognormal", the shape parameters are the standard deviations for log(X). For model = "chisquare", the parameters are the degrees of freedom (df) for the two distributions. For model = "logistic", the parameters are the scale factors for the distributions. For model = "exponential", the parameters are the rate parameters for the distributions.

For the Gumbel distribution, the E variate is equal to delta - shape2*log(log(1/U)) where U is a random value sampled from the uniform distribution on the interval [.00001, .99999], and the C variate is equal to -shape1*log(log(1/U)) where U is another score sampled from the uniform distribution. The shape1 and shape2 arguments for model = "gumbel" are the scale parameters for the distributions. The Pareto model is a distribution designed to account for income distributions as studied by economists (Pareto, 1897). For the Pareto distribution, the cumulative function is equal to 1-(x_m/x)^alpha where x is greater than x_m (Arnold, 1983). In the E condition, x_m = 1 + delta and in the C condition x_m = 1. The alpha parameter is 1.16 times the shape parameters shape1 and shape2. Since the default value for each shape parameter is 1, the resulting alpha value of 1.16 is the default value. When alpha = 1.16, the Pareto distribution approximates an income distribution that represents the 80-20 law where 20% of the population receives 80% of the income (Hardy, 2010).

The block_max argument provides for incorporating block effects in the random sampling. The block effect for each score is a separate effect for the block. The block effect B for a score is a random number drawn from a uniform distribution on the interval [0, block_max]. When design = "paired", the same random block effect is added to the score in the first condition, which is the random C value, and it is also added to the corresponding paired value for the E variate. Thus, the pairing research design eliminates the effect of block variation for the assessment of condition differences. When design = "independent", there are different block-effect contributions to the E and C variates, which reduces the discrimination of condition differences because it increases the variability of the difference in the two variates. The user can study the effect of the relative discriminability of detecting an effect of delta by adjusting the value of the block_max argument. The default for block_max is 0, but it can be altered to any non-negative real number.

The output from calling the dfba_sim_data() function are two statistics that are based on the n scores generated in the two conditions. One statistic is the frequentist p-value for rejecting the null hypothesis that delta <= 0 from a parametric t-test. The p-value is the upper tail probability of the sample t-statistic for either the paired t-test when design = "paired" or it is the upper tail probability of the sample t-statistic for the two-group t-test when design = "independent". The second output statistic is the Bayesian posterior probability for one of two possible nonparametric tests. If design = "paired", the dfba_sim_sim() function calls the dfba_wilcoxon() function to ascertain the posterior probability that phi_w > .5. If design = "independent", the dfba_sim_data() function calls the dfba_mann_whitney() function to estimate the posterior probability that omega_E > .5. The arguments a0 and b0 for the dfba_sim_data() function are passed along to either the dfba_wilcoxon() function or the dfba_mann_whitney() function. The default values are a0 = b0 = 1.

Value

A list containing the following components:

`pvalue`	The upper tail of the sample t value for the test that delta <= 0
`prH1`	Bayesian posterior probability either for the hypothesis that phi_w > .5 from the nonparametric Wilcoxon test when `design = "paired"` or for the hypothesis that omega_E > .5 from the Mann-Whitney test when `design = "independent"`
`C`	Vector of length n of simulated values for condition 1
`E`	Vector of length n of simulated values for condition 2
`design`	The data structure indicated by the `design` argument. One of `"independent"` or `"paired"`.

Note

Random sampling for both the Gumbel and the Pareto distributions are generated by the dfba_sim_data() function using the inverse transform method (Fishman, 1996).

References

Arnold, B. C. (1983). Pareto Distribution. Fairland, MD: International Cooperative Publishing House.

Chechile, R. A. (2017). A Bayesian analysis for the Wilcoxon signed-rank statistic. Communications in Statistics - Theory and Methods, https://doi.org/10.1080/03610926.2017.1388402.

Chechile, R. A. (2020). A Bayesian analysis for the Mann- Whitney statistic. Communications in Statistics - Theory and Methods, https://doi.org/10.1080/03610926.2018.1549247.

Fishman, G. S. (1996) Monte Carlo: Concepts, Algorithms and Applications. New York: Springer.

Hardy, M. (2010). Pareto's Law. Mathematical Intelligencer, 32, 38-43.

Johnson, N. L., Kotz S., and Balakrishnan, N. (1995). Continuous Univariate Distributions, Vol. 1, New York: Wiley.

Pareto, V. (1897). Cours d'Economie Politique. Vol. 2, Lausanne: F. Rouge.

Examples


# Example of two paired normal distributions where the s.d. of the two
# conditions are 1 and 4.

dfba_sim_data(n = 50,
             model = "normal",
             design = "paired",
             delta = .4,
             shape1 = 1,
             shape2 = 4)

# Example of two independent Weibull variates with their shape parameters =.8
# and with a .25 offset

dfba_sim_data(n = 80,
              model = "weibull",
              design = "independent",
              delta = .25,
              shape1 = .8,
              shape2 = .8)

# Example of two independent Weibull variates with their shape
# parameters = .8 and with a .25 offset along with some block differences
# with the max block effect being 1.5

dfba_sim_data(n = 80,
             model = "weibull",
             design = "independent",
             delta = .25,
             shape1 = .8,
             shape2 = .8,
             block_max = 1.5)

# Example of two paired Cauchy variates with a .4 offset

dfba_sim_data(n = 50,
             model = "cauchy",
             design = "paired",
             delta = .4)
# Example of two paired Cauchy variates with a .4 offset where the Bayesian
# analysis uses the Jeffreys prior

dfba_sim_data(n = 50,
             a0 = .5,
             b0 = .5,
             model = "cauchy",
             design = "paired",
             delta=.4)

[Package DFBA version 0.1.0 Index]