sim_Y_MVN_X {sim2Dpredictr}R Documentation

Simulate Scalar Outcomes from Simulated Spatially Correlated Predictors

Description

N spatially correlated design vectors are simulated from an MVN. These design vectors are used to then simulate scalar outcomes that have one of Gaussian, Binomial, Multinomial or Poisson distributions.

Usage

sim_Y_MVN_X(
  N,
  B,
  L = NULL,
  R = NULL,
  S = NULL,
  Q = NULL,
  use.spam = TRUE,
  mu = 0,
  rand.err = 1,
  dist = "gaussian",
  V = NULL,
  incl.subjectID = TRUE,
  threshold.method = "none",
  Y.thresh = NULL,
  X.categorical = FALSE,
  X.num.categories = 2,
  X.category.type = "percentile",
  X.manual.thresh = NULL,
  X.cat.names = NULL,
  print.out = FALSE
)

Arguments

N

The number of draws to take from MVN; i.e., the number of subjects.

B

A vector parameter values; i.e. "betas". Note that length(B) must equal p + 1 = n.row * n.col + 1; e.g. for normal outcomes Y = XB + e with Y a scalar outcome and e the random error. Note that when dist = "multinomial" then B should be a list with length equal to V - 1, i.e., should contain parameter values associated with all categories except the reference category. Alternatively, when dist = "multinomial" B may be a list of length V if one desires to specify parameters for every category, i.e., the over-parameterized model used in Friedman (2010).

L, R

L and R are lower and upper triangular matrices, respectively, and are the Cholesky factor(s) of the desired covariance matrix for the MVN. Obtain L or R via chol_s2Dp() with settings triangle = "lower" or triangle = "upper", respectively. Specify either L or R, but NOT both.

S, Q

A covariance or precision matrix respectively. These are for use with spam, and can be extracted from output of chol_s2Dp after choosing return.cov = TRUE or return.prec = TRUE, respectively.

use.spam

Logical. If use.spam = TRUE then use tools from the R package spam; otherwise, base R functions are employed. For large dimension MVN with sparse correlation structure, spam is recommended; otherwise, base R may be faster. Defaults to FALSE. Requires either the covariance matrix S or precision matrix, Q, that corresponds to the Cholesky factor.

mu

One of the following:

  • A single scalar value for common mean.

  • A vector of length nrow(R) (equivalently nrow(R)) containing means for the MVN.

rand.err

A vector for the random error standard deviation when dist = "gaussian", or thresholding is used to obtain non-Normal draws. Must have length 1 or length N.

dist

The distribution of the scalar outcome.

  • dist = "gaussian" has Y = XB + e, where e ~ N(0, rand.err).

  • dist = "binomial": Y is drawn from a binomial distribution with probability of "success" equal to 1 / (1 + 1 / exp(XB)) using rbinom() when binary.method = "traditional". If binary.method = "gaussian", then simulation is based on a cutoff using binary.cutoff.

  • dist = "multinomial": Y is drawn from sample() using probabilities generated based on Chapter 6.1.3 of Agresti (2007) when length(B) = V - 1 or Friedman (2010) when the length(B) = V. Threshold-based approaches are not currently supported.

  • dist = "poisson": Y is drawn from Poisson(exp(XB)) using rpois().

V

A numeric value stating the number of categories desired when dist = "multinomial".

incl.subjectID

When incl.subjectID = TRUE a column of subject indices is generated.

threshold.method

One of "none", "manual", "percentile", "round". When "none" draws from Binomial or Poisson distributions are taken subject-wise using base R functions. For the remaining options, draws are first taken from a Normal distribution and then thresholded. "manual" uses Y.thresh to manually select a cutoff, "percentile" uses Y.thresh to select percentiles used to bin outcomes, and "round" sets values equal or less than 0 to 0, and rounds all positive values to the nearest whole number.

Y.thresh

A manual value used to threshold when threshold.method = "manual"; values equal or greater than the cutoff are assigned 1 and all others 0. When threshold.method = "percentile", a percentile to use to bin outcomes.

X.categorical

Default is X.categorical = FALSE. If X.categorical = TRUE then thresholds are applied to categorize each predictor/image value.

X.num.categories

A scalar value denoting the number of categories in which to divide the data.

X.category.type

Tells R how to categorize the data. Options are X.category.type = c("percentile", "manual"). If X.category.type = "percentile" then the data are divided into percentiles based on X.num.categories; e.g. if X.num.categories = 4 then the values are divided into quartiles, and values in Q1 equal 0, between Q1 and Q2 equal 1, between Q2 and Q3 equal 2, and greater than Q3 equal 3. If X.category.type = "manual" then specify the cutoff points with X.manual.thresh.

X.manual.thresh

A vector containing the thresholds for categorizing the values; e.g. if X.num.categories = 4 and X.manual.thresh = c(-3, 1, 17), then values less than -3 are set to 0, equal or greater than -3 and less than 1 are set to 1, equal or greater than 1 but less than 17 are set to 2, and equal or greater than 17 are set to 3. Note that length(X.manual.thresh) must always equal X.num.categories - 1.

X.cat.names

A vector of category names. If X.cat.names = NULL then the initial integers assigned are left as the values; the names in X.cat.names are assigned in ascending order.

print.out

If print.out = TRUE then print the following for each subject, indexed y:

  • X[y] %*% B

  • p[y], lambda[y] for Binomial, Poisson, respectively.

This is useful to see the effect of image parameter selection and beta parameter selection on distributional parameters for the outcome of interest.

Value

A data frame where each row consists of a single subject's data. Col 1 is the outcome, Y, and each successive column contains the subject predictor values.

Note

Careful parameter selection, i.e. B, is necessary to ensure that simulated outcomes are reasonable; in particular, counts arising from the Poisson distribution can be unnaturally large.

References

Furrer R, Sain SR (2010). “spam: A Sparse Matrix R Package with Emphasis on MCMC Methods for Gaussian Markov Random Fields.” Journal of Statistical Software, 36(10), 1-25. https://www.jstatsoft.org/v36/i10/.

Ripley BD (1987). Stochastic Simulation. John Wiley & Sons. doi:10.1002/9780470316726.

Rue H (2001). “Fast Sampling of Gaussian Markov Random Fields.” Journal of the Royal Statistical Society B, 63, 325-338. doi:10.1111/1467-9868.00288.

Agresti A (2007). An Introduction to Categorical Analysis, 2nd edition. John Wiley & Sons, Hoboken, New Jersey.

Friedman J, Hastie T, Tibshirani R (2010). “Regularization paths for generalized linear models via coordinate descent.” Journal of Statistical Software, 33, 1-22. doi:10.18637/jss.v033.i01.

Examples

## generate precision matrix and take Cholesky decomposition
Rpre <- chol_s2Dp(im.res = c(3, 3), matrix.type = "prec",
                  use.spam = TRUE, neighborhood = "ar1",
                  triangle = "upper", return.prec = TRUE)
## Generate correlation matrix & take Cholesky decomposition
Rcov <- chol_s2Dp(corr.structure = "ar1", im.res = c(3, 3), 
                  rho = 0.5,
                  triangle = "upper",
                  use.spam = FALSE, neighborhood = "none")

## Define non-zero beta values
Bex <- beta_builder(row.index = c(2, 3), 
                    col.index = c(3, 3),
                    im.res = c(3, 3),
                    B0 = 0, B.values = rep(1, 2),
                    output.indices = FALSE)
## Simulate Datasets
## parameter values
Nex = 100
set.seed(28743)

## with precision matrix
Gauss.exp <- sim_Y_MVN_X(N = Nex, B = Bex,
                         R = Rpre$R, Q = Rpre$Q,
                         dist = "gaussian")
hist(Gauss.exp$Y)

## with covariance matrix
Gauss.exc <- sim_Y_MVN_X(N = Nex, B = Bex,
                         R = Rcov$R, S = Rcov$S,
                         dist = "gaussian")
hist(Gauss.exc$Y)

## direct draws from binomial
Bin.ex <- sim_Y_MVN_X(N = Nex, B = Bex, R = Rcov$R, S = Rcov$S,
                      dist = "binomial", print.out = TRUE)
table(Bin.ex$Y)

## manual cutoff
Bin.ex2 <- sim_Y_MVN_X(N = Nex, B = Bex,
                       R = Rcov$R, S = Rcov$S,
                       dist = "binomial",
                       threshold.method = "manual",
                       Y.thresh = 1.25)
table(Bin.ex2$Y)

## percentile cutoff
Bin.ex3 <- sim_Y_MVN_X(N = Nex, B = Bex,
                       R = Rcov$R, S = Rcov$S,
                       dist = "binomial",
                       threshold.method = "percentile",
                       Y.thresh = 0.75)
table(Bin.ex3$Y)

## Poisson Example - note the large counts
Pois.ex <- sim_Y_MVN_X(N = Nex, B = Bex,
                       R = Rcov$R, S = Rcov$S,
                       dist = "poisson", print.out = TRUE)
mean(Pois.ex$Y)
quantile(Pois.ex$Y, 
         probs = c(0, 0.1, 0.25, 0.45, 0.5,
                   0.75, 0.9, 0.95, 0.99, 1))
hist(Pois.ex$Y)

[Package sim2Dpredictr version 0.1.1 Index]