R: Generation of Correlated Ordinal, Continuous (mixture and...

corrvar {SimCorrMix}

R Documentation

Generation of Correlated Ordinal, Continuous (mixture and non-mixture), and/or Count (Poisson and Negative Binomial, regular and zero-inflated) Variables: Correlation Method 1

Description

This function simulates k_cat ordinal (r \ge 2 categories), k_cont continuous non-mixture, k_mix continuous mixture, k_pois Poisson (regular and zero-inflated), and/or k_nb Negative Binomial (regular and zero-inflated) variables with a specified correlation matrix rho. The variables are generated from multivariate normal variables with intermediate correlation matrix Sigma, calculated by intercorr, and then transformed. The intermediate correlations involving count variables are determined using correlation method 1. The ordering of the variables in rho must be 1st ordinal, 2nd continuous non-mixture, 3rd components of the continuous mixture, 4th regular Poisson, 5th zero-inflated Poisson, 6th regular NB, and 7th zero-inflated NB. Note that it is possible for k_cat, k_cont, k_mix, k_pois, and/or k_nb to be 0. Simulation occurs at the component-level for continuous mixture distributions. The target correlation matrix is specified in terms of correlations with components of continuous mixture variables. There are no parameter input checks in order to decrease simulation time. All inputs should be checked prior to simulation with validpar and validcorr. Summaries for the simulation results can be obtained with summary_var.

All continuous variables are simulated using either Fleishman's third-order (method = "Fleishman", doi: 10.1007/BF02293811) or Headrick's fifth-order (method = "Polynomial", doi: 10.1016/S0167-9473(02)00072-5) power method transformation. It works by matching standardized cumulants – the first four (mean, variance, skew, and standardized kurtosis) for Fleishman's method, or the first six (mean, variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick's method. The transformation is expressed as follows:

Y = c_0 + c_1 * Z + c_2 * Z^2 + c_3 * Z^3 + c_4 * Z^4 + c_5 * Z^5, Z \sim N(0,1),

where c_4 and c_5 both equal 0 for Fleishman's method. The real constants are calculated by
find_constants. Continuous mixture variables are generated componentwise and then transformed to the desired mixture variables based on random multinomial variables generated from the mixing probabilities. Ordinal variables (r \ge 2 categories) are generated by discretizing the standard normal variables at quantiles. These quantiles are determined by evaluating the inverse standard normal CDF at the cumulative probabilities defined by each variable's marginal distribution. Count variables are generated using the inverse CDF method. The CDF of a standard normal variable has a uniform distribution. The appropriate quantile function (F_Y)^(-1) is applied to this uniform variable with the designated parameters to generate the count variable: Y = (F_Y)^(-1)(Phi(Z)). The Negative Binomial variable represents the number of failures which occur in a sequence of Bernoulli trials before the target number of successes is achieved. Zero-inflated Poisson or NB variables are obtained by setting the probability of a structural zero to be greater than 0. The optional error loop attempts to correct the final pairwise correlations to be within a user-specified precision value (epsilon) of the target correlations.

The vignette Variable Types discusses how each of the different variables are generated and describes the required parameters.

The vignette Overall Workflow for Generation of Correlated Data provides a detailed example discussing the step-by-step simulation process and comparing correlation methods 1 and 2.

Usage

corrvar(n = 10000, k_cat = 0, k_cont = 0, k_mix = 0, k_pois = 0,
  k_nb = 0, method = c("Fleishman", "Polynomial"), means = NULL,
  vars = NULL, skews = NULL, skurts = NULL, fifths = NULL,
  sixths = NULL, Six = list(), mix_pis = list(), mix_mus = list(),
  mix_sigmas = list(), mix_skews = list(), mix_skurts = list(),
  mix_fifths = list(), mix_sixths = list(), mix_Six = list(),
  marginal = list(), support = list(), lam = NULL, p_zip = 0,
  size = NULL, prob = NULL, mu = NULL, p_zinb = 0, rho = NULL,
  seed = 1234, errorloop = FALSE, epsilon = 0.001, maxit = 1000,
  use.nearPD = TRUE, nrand = 100000, Sigma = NULL, cstart = list(),
  quiet = FALSE)

Arguments

`n`	the sample size (i.e. the length of each simulated variable; default = 10000)
`k_cat`	the number of ordinal (r >= 2 categories) variables (default = 0)
`k_cont`	the number of continuous non-mixture variables (default = 0)
`k_mix`	the number of continuous mixture variables (default = 0)
`k_pois`	the number of regular Poisson and zero-inflated Poisson variables (default = 0)
`k_nb`	the number of regular Negative Binomial and zero-inflated Negative Binomial variables (default = 0)
`method`	the method used to generate the `k_cont` non-mixture and `k_mix` mixture continuous variables. "Fleishman" uses Fleishman's third-order polynomial transformation and "Polynomial" uses Headrick's fifth-order transformation.
`means`	a vector of means for the `k_cont` non-mixture and `k_mix` mixture continuous variables (i.e. `rep(0, (k_cont + k_mix))`)
`vars`	a vector of variances for the `k_cont` non-mixture and `k_mix` mixture continuous variables (i.e. `rep(1, (k_cont + k_mix))`)
`skews`	a vector of skewness values for the `k_cont` non-mixture continuous variables
`skurts`	a vector of standardized kurtoses (kurtosis - 3, so that normal variables have a value of 0) for the `k_cont` non-mixture continuous variables
`fifths`	a vector of standardized fifth cumulants for the `k_cont` non-mixture continuous variables (not necessary for `method` = "Fleishman")
`sixths`	a vector of standardized sixth cumulants for the `k_cont` non-mixture continuous variables (not necessary for `method` = "Fleishman")
`Six`	a list of vectors of sixth cumulant correction values for the `k_cont` non-mixture continuous variables if no valid PDF constants are found, ex: `Six = list(seq(0.01, 2, 0.01), seq(1, 10, 0.5))`; if no correction is desired for `Y_{cont_i}`, set the i-th list component equal to `NULL`; if no correction is desired for any of the `Y_{cont}` keep as `Six = list()` (not necessary for `method` = "Fleishman")
`mix_pis`	a list of length `k_mix` with i-th component a vector of mixing probabilities that sum to 1 for component distributions of `Y_{mix_i}`
`mix_mus`	a list of length `k_mix` with i-th component a vector of means for component distributions of `Y_{mix_i}`
`mix_sigmas`	a list of length `k_mix` with i-th component a vector of standard deviations for component distributions of `Y_{mix_i}`
`mix_skews`	a list of length `k_mix` with i-th component a vector of skew values for component distributions of `Y_{mix_i}`
`mix_skurts`	a list of length `k_mix` with i-th component a vector of standardized kurtoses for component distributions of `Y_{mix_i}`
`mix_fifths`	a list of length `k_mix` with i-th component a vector of standardized fifth cumulants for component distributions of `Y_{mix_i}` (not necessary for `method` = "Fleishman")
`mix_sixths`	a list of length `k_mix` with i-th component a vector of standardized sixth cumulants for component distributions of `Y_{mix_i}` (not necessary for `method` = "Fleishman")
`mix_Six`	a list of length `k_mix` with i-th component a list of vectors of sixth cumulant correction values for component distributions of `Y_{mix_i}`; use `NULL` if no correction is desired for a given component or mixture variable; if no correction is desired for any of the `Y_{mix}` keep as `mix_Six = list()` (not necessary for `method` = "Fleishman")
`marginal`	a list of length equal to `k_cat`; the i-th element is a vector of the cumulative probabilities defining the marginal distribution of the i-th variable; if the variable can take r values, the vector will contain r - 1 probabilities (the r-th is assumed to be 1); for binary variables, these should be input the same as for ordinal variables with more than 2 categories (i.e. the user-specified probability is the probability of the 1st category, which has the smaller support value)
`support`	a list of length equal to `k_cat`; the i-th element is a vector containing the r ordered support values; if not provided (i.e. `support = list()`), the default is for the i-th element to be the vector 1, ..., r
`lam`	a vector of lambda (mean > 0) constants for the Poisson variables (see `stats::dpois`); the order should be 1st regular Poisson variables, 2nd zero-inflated Poisson variables
`p_zip`	a vector of probabilities of structural zeros (not including zeros from the Poisson distribution) for the zero-inflated Poisson variables (see `VGAM::dzipois`); if `p_zip` = 0, `Y_{pois}` has a regular Poisson distribution; if `p_zip` is in (0, 1), `Y_{pois}` has a zero-inflated Poisson distribution; if `p_zip` is in `(-(exp(lam) - 1)^(-1), 0)`, `Y_{pois}` has a zero-deflated Poisson distribution and `p_zip` is not a probability; if `p_zip = -(exp(lam) - 1)^(-1)`, `Y_{pois}` has a positive-Poisson distribution (see `VGAM::dpospois`); if `length(p_zip) < length(lam)`, the missing values are set to 0 (and ordered 1st)
`size`	a vector of size parameters for the Negative Binomial variables (see `stats::dnbinom`); the order should be 1st regular NB variables, 2nd zero-inflated NB variables
`prob`	a vector of success probability parameters for the NB variables; order the same as in `size`
`mu`	a vector of mean parameters for the NB variables (*Note: either `prob` or `mu` should be supplied for all Negative Binomial variables, not a mixture; default = NULL); order the same as in `size`; for zero-inflated NB this refers to the mean of the NB distribution (see `VGAM::dzinegbin`)
`p_zinb`	a vector of probabilities of structural zeros (not including zeros from the NB distribution) for the zero-inflated NB variables (see `VGAM::dzinegbin`); if `p_zinb` = 0, `Y_{nb}` has a regular NB distribution; if `p_zinb` is in `(-prob^size/(1 - prob^size),` `0)`, `Y_{nb}` has a zero-deflated NB distribution and `p_zinb` is not a probability; if `p_zinb = -prob^size/(1 - prob^size)`, `Y_{nb}` has a positive-NB distribution (see `VGAM::dposnegbin`); if `length(p_zinb) < length(size)`, the missing values are set to 0 (and ordered 1st)
`rho`	the target correlation matrix which must be ordered 1st ordinal, 2nd continuous non-mixture, 3rd components of continuous mixtures, 4th regular Poisson, 5th zero-inflated Poisson, 6th regular NB, 7th zero-inflated NB; note that `rho` is specified in terms of the components of `Y_{mix}`
`seed`	the seed value for random number generation (default = 1234)
`errorloop`	if TRUE, uses `corr_error` to attempt to correct final pairwise correlations to be within `epsilon` of target pairwise correlations (default = FALSE)
`epsilon`	the maximum acceptable error between the final and target pairwise correlations (default = 0.001) in the calculation of ordinal intermediate correlations with `ord_norm` or in the error loop
`maxit`	the maximum number of iterations to use (default = 1000) in the calculation of ordinal intermediate correlations with `ord_norm` or in the error loop
`use.nearPD`	TRUE to convert the overall intermediate correlation matrix to the nearest positive definite matrix with `Matrix::nearPD` if necessary; if FALSE the negative eigenvalues are replaced with 0 if necessary
`nrand`	the number of random numbers to generate in calculating intermediate correlations with `intercorr` (default = 10000)
`Sigma`	an intermediate correlation matrix to use if the user wants to provide one, else it is calculated within by `intercorr`
`cstart`	a list of length equal to `k_cont` + the total number of mixture components containing initial values for root-solving algorithm used in `find_constants`. If user specified, each list element must be input as a matrix. For `method` = "Fleishman", each should have 3 columns for `c_1, c_2, c_3`; for `method` = "Polynomial", each should have 5 columns for `c_1, c_2, c_3, c_4, c_5`. If no starting values are specified for a given component, that list element should be `NULL`.
`quiet`	if FALSE prints simulation messages, if TRUE suppresses message printing

Value

A list whose components vary based on the type of simulated variables.

If ordinal variables are produced: Y_cat the ordinal variables,

If continuous variables are produced:

constants a data.frame of the constants,

Y_cont the continuous non-mixture variables,

Y_comp the components of the continuous mixture variables,

Y_mix the continuous mixture variables,

sixth_correction a list of sixth cumulant correction values,

valid.pdf a vector where the i-th element is "TRUE" if the constants for the i-th continuous variable generate a valid PDF, else "FALSE"

If Poisson variables are produced: Y_pois the regular and zero-inflated Poisson variables,

If Negative Binomial variables are produced: Y_nb the regular and zero-inflated Negative Binomial variables,

Additionally, the following elements:

Sigma the intermediate correlation matrix (after the error loop),

Error_Time the time in minutes required to use the error loop,

Time the total simulation time in minutes,

niter a matrix of the number of iterations used for each variable in the error loop,

Overview of Correlation Method 1

The intermediate correlations used in method 1 are more simulation based than those in method 2, which means that accuracy increases with sample size and the number of repetitions. In addition, specifying the seed allows for reproducibility. In addition, method 1 differs from method 2 in the following ways:

1) The intermediate correlation for count variables is based on the method of Yahav & Shmueli (2012, doi: 10.1002/asmb.901), which uses a simulation based, logarithmic transformation of the target correlation. This method becomes less accurate as the variable mean gets closer to zero.

2) The ordinal - count variable correlations are based on an extension of the method of Amatya & Demirtas (2015, doi: 10.1080/00949655.2014.953534), in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and a simulated upper bound on the correlation between an ordinal variable and the normal variable used to generate it (see Demirtas & Hedeker, 2011, doi: 10.1198/tast.2011.10090).

3) The continuous - count variable correlations are based on an extension of the methods of Amatya & Demirtas (2015) and Demirtas et al. (2012, doi: 10.1002/sim.5362), in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and the power method correlation between the continuous variable and the normal variable used to generate it (see Headrick & Kowalchuk, 2007, doi: 10.1080/10629360600605065). The intermediate correlations are the ratio of the target correlations to the correction factor.

Please see the Comparison of Correlation Methods 1 and 2 vignette for more information and a step-by-step overview of the simulation process.

Choice of Fleishman's third-order or Headrick's fifth-order method

Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving accuracy. In addition, the range of feasible standardized kurtosis (\gamma_2) values, given skew (\gamma_1) and standardized fifth (\gamma_3) and sixth (\gamma_4) cumulants, is larger than with Fleishman's method (see calc_lower_skurt). For example, the Fleishman method can not be used to generate a non-normal distribution with a ratio of \gamma_1^2/\gamma_2 > 9/14 (see Headrick & Kowalchuk, 2007). This eliminates the Chi-squared family of distributions, which has a constant ratio of \gamma_1^2/\gamma_2 = 2/3. The fifth-order method also generates more distributions with valid PDF's. However, if the fifth and sixth cumulants are unknown or do not exist, the Fleishman approximation should be used.

Reasons for Function Errors

1) The most likely cause for function errors is that no solutions to fleish or poly converged when using find_constants. If this happens, the simulation will stop. It may help to first use find_constants for each continuous variable to determine if a sixth cumulant correction value is needed. The solutions can be used as starting values (see cstart below). If the standardized cumulants are obtained from calc_theory, the user may need to use rounded values as inputs (i.e. skews = round(skews, 8)). For example, in order to ensure that skew is exactly 0 for symmetric distributions.

2) The kurtosis may be outside the region of possible values. There is an associated lower boundary for kurtosis associated with a given skew (for Fleishman's method) or skew and fifth and sixth cumulants (for Headrick's method). Use calc_lower_skurt to determine the boundary for a given set of cumulants.

3) The feasibility of the final correlation matrix rho, given the distribution parameters, should be checked first using validcorr. This function either checks if a given rho is plausible or returns the lower and upper final correlation limits. It should be noted that even if a target correlation matrix is within the "plausible range," it still may not be possible to achieve the desired matrix. This happens most frequently when generating ordinal variables or using negative correlations. The error loop frequently fixes these problems.

References

Please see references for SimCorrMix.

Examples

Sim1 <- corrvar(n = 1000, k_cat = 1, k_cont = 1, method = "Polynomial",
  means = 0, vars = 1, skews = 0, skurts = 0, fifths = 0, sixths = 0,
  marginal = list(c(1/3, 2/3)), support = list(0:2),
  rho = matrix(c(1, 0.4, 0.4, 1), 2, 2), quiet = TRUE)

## Not run: 

# 2 continuous mixture, 1 binary, 1 zero-inflated Poisson, and
# 1 zero-inflated NB variable
n <- 10000
seed <- 1234

# Mixture variables: Normal mixture with 2 components;
# mixture of Logistic(0, 1), Chisq(4), Beta(4, 1.5)
# Find cumulants of components of 2nd mixture variable
L <- calc_theory("Logistic", c(0, 1))
C <- calc_theory("Chisq", 4)
B <- calc_theory("Beta", c(4, 1.5))

skews <- skurts <- fifths <- sixths <- NULL
Six <- list()
mix_pis <- list(c(0.4, 0.6), c(0.3, 0.2, 0.5))
mix_mus <- list(c(-2, 2), c(L[1], C[1], B[1]))
mix_sigmas <- list(c(1, 1), c(L[2], C[2], B[2]))
mix_skews <- list(rep(0, 2), c(L[3], C[3], B[3]))
mix_skurts <- list(rep(0, 2), c(L[4], C[4], B[4]))
mix_fifths <- list(rep(0, 2), c(L[5], C[5], B[5]))
mix_sixths <- list(rep(0, 2), c(L[6], C[6], B[6]))
mix_Six <- list(list(NULL, NULL), list(1.75, NULL, 0.03))
Nstcum <- calc_mixmoments(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]],
  mix_skews[[1]], mix_skurts[[1]], mix_fifths[[1]], mix_sixths[[1]])
Mstcum <- calc_mixmoments(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]],
  mix_skews[[2]], mix_skurts[[2]], mix_fifths[[2]], mix_sixths[[2]])
means <- c(Nstcum[1], Mstcum[1])
vars <- c(Nstcum[2]^2, Mstcum[2]^2)

marginal <- list(0.3)
support <- list(c(0, 1))
lam <- 0.5
p_zip <- 0.1
size <- 2
prob <- 0.75
p_zinb <- 0.2

k_cat <- k_pois <- k_nb <- 1
k_cont <- 0
k_mix <- 2
Rey <- matrix(0.39, 8, 8)
diag(Rey) <- 1
rownames(Rey) <- colnames(Rey) <- c("O1", "M1_1", "M1_2", "M2_1", "M2_2",
  "M2_3", "P1", "NB1")

# set correlation between components of the same mixture variable to 0
Rey["M1_1", "M1_2"] <- Rey["M1_2", "M1_1"] <- 0
Rey["M2_1", "M2_2"] <- Rey["M2_2", "M2_1"] <- Rey["M2_1", "M2_3"] <- 0
Rey["M2_3", "M2_1"] <- Rey["M2_2", "M2_3"] <- Rey["M2_3", "M2_2"] <- 0

# check parameter inputs
validpar(k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
  vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
  mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal, support,
  lam, p_zip, size, prob, mu = NULL, p_zinb, rho = Rey)

# check to make sure Rey is within the feasible correlation boundaries
validcorr(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
  vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
  mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal,
  lam, p_zip, size, prob, mu = NULL, p_zinb, Rey, seed)

# simulate without the error loop
Sim2 <- corrvar(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial", means,
  vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus, mix_sigmas,
  mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six, marginal, support,
  lam, p_zip, size, prob, mu = NULL, p_zinb, Rey, seed, epsilon = 0.01)

names(Sim2)

# simulate with the error loop
Sim2_EL <- corrvar(n, k_cat, k_cont, k_mix, k_pois, k_nb, "Polynomial",
  means, vars, skews, skurts, fifths, sixths, Six, mix_pis, mix_mus,
  mix_sigmas, mix_skews, mix_skurts, mix_fifths, mix_sixths, mix_Six,
  marginal, support, lam, p_zip, size, prob, mu = NULL, p_zinb, Rey,
  seed, errorloop = TRUE, epsilon = 0.01)

names(Sim2_EL)

## End(Not run)

[Package SimCorrMix version 0.1.1 Index]