SimulateRegression {fake}R Documentation

Data simulation for multivariate regression

Description

Simulates data with outcome(s) and predictors, where only a subset of the predictors actually contributes to the definition of the outcome(s).

Usage

SimulateRegression(
  n = 100,
  pk = 10,
  xdata = NULL,
  family = "gaussian",
  q = 1,
  theta = NULL,
  nu_xy = 0.2,
  beta_abs = c(0.1, 1),
  beta_sign = c(-1, 1),
  continuous = TRUE,
  ev_xy = 0.7
)

Arguments

n

number of observations in the simulated dataset. Not used if xdata is provided.

pk

number of predictor variables. A subset of these variables contribute to the outcome definition (see argument nu_xy). Not used if xdata is provided.

xdata

optional data matrix for the predictors with variables as columns and observations as rows. A subset of these variables contribute to the outcome definition (see argument nu_xy).

family

type of regression model. Possible values include "gaussian" for continuous outcome(s) or "binomial" for binary outcome(s).

q

number of outcome variables.

theta

binary matrix with as many rows as predictors and as many columns as outcomes. A nonzero entry on row i and column j indicates that predictor i contributes to the definition of outcome j.

nu_xy

vector of length q with expected proportion of predictors contributing to the definition of each of the q outcomes.

beta_abs

vector defining the range of nonzero regression coefficients in absolute values. If continuous=FALSE, beta_abs is the set of possible precision values. If continuous=TRUE, beta_abs is the range of possible precision values. Note that regression coefficients are re-scaled if family="binomial" to ensure that the desired concordance statistic can be achieved (see argument ev_xy) so they may not be in this range.

beta_sign

vector of possible signs for regression coefficients. Possible inputs are: 1 for positive coefficients, -1 for negative coefficients, or c(-1, 1) for both positive and negative coefficients.

continuous

logical indicating whether to sample regression coefficients from a uniform distribution between the minimum and maximum values in beta_abs (if continuous=TRUE) or from proposed values in beta_abs (if continuous=FALSE).

ev_xy

vector of length q with expected goodness of fit measures for each of the q outcomes. If family="gaussian", the vector contains expected proportions of variance in each of the q outcomes that can be explained by the predictors. If family="binomial", the vector contains expected concordance statistics (i.e. area under the ROC curve) given the true probabilities.

Value

A list with:

xdata

input or simulated predictor data.

ydata

simulated outcome data.

beta

matrix of true beta coefficients used to generate outcomes in ydata from predictors in xdata.

theta

binary matrix indicating the predictors from xdata contributing to the definition of each of the outcome variables in ydata.

References

Bodinier B, Filippi S, Nost TH, Chiquet J, Chadeau-Hyam M (2021). “Automated calibration for stability selection in penalised regression and graphical models: a multi-OMICs network application exploring the molecular response to tobacco smoking.” https://arxiv.org/abs/2106.02521.

See Also

Other simulation functions: SimulateAdjacency(), SimulateClustering(), SimulateComponents(), SimulateCorrelation(), SimulateGraphical(), SimulateStructural()

Examples


## Independent predictors

# Univariate continuous outcome
set.seed(1)
simul <- SimulateRegression(pk = 15)
summary(simul)

# Univariate binary outcome
set.seed(1)
simul <- SimulateRegression(pk = 15, family = "binomial")
table(simul$ydata)

# Multiple continuous outcomes
set.seed(1)
simul <- SimulateRegression(pk = 15, q = 3)
summary(simul)


## Blocks of correlated predictors

# Simulation of predictor data
set.seed(1)
xsimul <- SimulateGraphical(pk = rep(5, 3), nu_within = 0.8, nu_between = 0, v_sign = -1)
Heatmap(cor(xsimul$data),
  legend_range = c(-1, 1),
  col = c("navy", "white", "darkred")
)

# Simulation of outcome data
simul <- SimulateRegression(xdata = xsimul$data)
print(simul)
summary(simul)


## Choosing expected proportion of explained variance

# Data simulation
set.seed(1)
simul <- SimulateRegression(n = 1000, pk = 15, q = 3, ev_xy = c(0.9, 0.5, 0.2))
summary(simul)

# Comparing with estimated proportion of explained variance
summary(lm(simul$ydata[, 1] ~ simul$xdata))
summary(lm(simul$ydata[, 2] ~ simul$xdata))
summary(lm(simul$ydata[, 3] ~ simul$xdata))


## Choosing expected concordance (AUC)

# Data simulation
set.seed(1)
simul <- SimulateRegression(
  n = 500, pk = 10,
  family = "binomial", ev_xy = 0.9
)

# Comparing with estimated concordance
fitted <- glm(simul$ydata ~ simul$xdata,
  family = "binomial"
)$fitted.values
Concordance(observed = simul$ydata, predicted = fitted)


[Package fake version 1.4.0 Index]