R: Data simulation for multivariate regression

SimulateRegression {fake}

R Documentation

Data simulation for multivariate regression

Description

Simulates data with outcome(s) and predictors, where only a subset of the predictors actually contributes to the definition of the outcome(s).

Usage

SimulateRegression(
  n = 100,
  pk = 10,
  xdata = NULL,
  family = "gaussian",
  q = 1,
  theta = NULL,
  nu_xy = 0.2,
  beta_abs = c(0.1, 1),
  beta_sign = c(-1, 1),
  continuous = TRUE,
  ev_xy = 0.7
)

Arguments

`n`	number of observations in the simulated dataset. Not used if `xdata` is provided.
`pk`	number of predictor variables. A subset of these variables contribute to the outcome definition (see argument `nu_xy`). Not used if `xdata` is provided.
`xdata`	optional data matrix for the predictors with variables as columns and observations as rows. A subset of these variables contribute to the outcome definition (see argument `nu_xy`).
`family`	type of regression model. Possible values include `"gaussian"` for continuous outcome(s) or `"binomial"` for binary outcome(s).
`q`	number of outcome variables.
`theta`	binary matrix with as many rows as predictors and as many columns as outcomes. A nonzero entry on row `i` and column `j` indicates that predictor `i` contributes to the definition of outcome `j`.
`nu_xy`	vector of length `q` with expected proportion of predictors contributing to the definition of each of the `q` outcomes.
`beta_abs`	vector defining the range of nonzero regression coefficients in absolute values. If `continuous=FALSE`, `beta_abs` is the set of possible precision values. If `continuous=TRUE`, `beta_abs` is the range of possible precision values. Note that regression coefficients are re-scaled if `family="binomial"` to ensure that the desired concordance statistic can be achieved (see argument `ev_xy`) so they may not be in this range.
`beta_sign`	vector of possible signs for regression coefficients. Possible inputs are: `1` for positive coefficients, `-1` for negative coefficients, or `c(-1, 1)` for both positive and negative coefficients.
`continuous`	logical indicating whether to sample regression coefficients from a uniform distribution between the minimum and maximum values in `beta_abs` (if `continuous=TRUE`) or from proposed values in `beta_abs` (if `continuous=FALSE`).
`ev_xy`	vector of length `q` with expected goodness of fit measures for each of the `q` outcomes. If `family="gaussian"`, the vector contains expected proportions of variance in each of the `q` outcomes that can be explained by the predictors. If `family="binomial"`, the vector contains expected concordance statistics (i.e. area under the ROC curve) given the true probabilities.

Value

A list with:

`xdata`	input or simulated predictor data.
`ydata`	simulated outcome data.
`beta`	matrix of true beta coefficients used to generate outcomes in `ydata` from predictors in `xdata`.
`theta`	binary matrix indicating the predictors from `xdata` contributing to the definition of each of the outcome variables in `ydata`.

References

Bodinier B, Filippi S, Nost TH, Chiquet J, Chadeau-Hyam M (2021). “Automated calibration for stability selection in penalised regression and graphical models: a multi-OMICs network application exploring the molecular response to tobacco smoking.” https://arxiv.org/abs/2106.02521.

Examples


## Independent predictors

# Univariate continuous outcome
set.seed(1)
simul <- SimulateRegression(pk = 15)
summary(simul)

# Univariate binary outcome
set.seed(1)
simul <- SimulateRegression(pk = 15, family = "binomial")
table(simul$ydata)

# Multiple continuous outcomes
set.seed(1)
simul <- SimulateRegression(pk = 15, q = 3)
summary(simul)


## Blocks of correlated predictors

# Simulation of predictor data
set.seed(1)
xsimul <- SimulateGraphical(pk = rep(5, 3), nu_within = 0.8, nu_between = 0, v_sign = -1)
Heatmap(cor(xsimul$data),
  legend_range = c(-1, 1),
  col = c("navy", "white", "darkred")
)

# Simulation of outcome data
simul <- SimulateRegression(xdata = xsimul$data)
print(simul)
summary(simul)


## Choosing expected proportion of explained variance

# Data simulation
set.seed(1)
simul <- SimulateRegression(n = 1000, pk = 15, q = 3, ev_xy = c(0.9, 0.5, 0.2))
summary(simul)

# Comparing with estimated proportion of explained variance
summary(lm(simul$ydata[, 1] ~ simul$xdata))
summary(lm(simul$ydata[, 2] ~ simul$xdata))
summary(lm(simul$ydata[, 3] ~ simul$xdata))


## Choosing expected concordance (AUC)

# Data simulation
set.seed(1)
simul <- SimulateRegression(
  n = 500, pk = 10,
  family = "binomial", ev_xy = 0.9
)

# Comparing with estimated concordance
fitted <- glm(simul$ydata ~ simul$xdata,
  family = "binomial"
)$fitted.values
Concordance(observed = simul$ydata, predicted = fitted)

[Package fake version 1.4.0 Index]