R: Generate simulated data

gen.data {bestridge}

R Documentation

Generate simulated data

Description

Generate data for simulations under the generalized linear model and Cox model.

Usage

gen.data(
  n,
  p,
  k = NULL,
  rho = 0,
  family = c("gaussian", "binomial", "poisson", "cox"),
  beta = NULL,
  cortype = 1,
  snr = 10,
  censoring = TRUE,
  c = 1,
  scal,
  sigma = 1,
  seed = 1
)

Arguments

`n`	The number of observations.
`p`	The number of predictors of interest.
`k`	The number of nonzero coefficients in the underlying regression model. Can be omitted if `beta` is supplied.
`rho`	A parameter used to characterize the pairwise correlation in predictors. Default is `0`.
`family`	The distribution of the simulated data. `"gaussian"` for gaussian data.`"binomial"` for binary data. `"poisson"` for count data. `"cox"` for survival data.
`beta`	The coefficient values in the underlying regression model.
`cortype`	The correlation structure. `cortype = 1` denotes the exponential structure, where the covariance matrix has `(i,j)` entry equals `rho^{\|i-j\|}`. codecortype = 2 denotes the constant structure, where the `(i,j)` entry of covariance matrix is `rho` for every `i \neq j` and 1 elsewhere. `cortype = 3` denotes the moving average structure. Details can be found below.
`snr`	A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as as the variance of `x\beta` divided by the variance of a gaussian noise: `\frac{Var(x\beta)}{\sigma^2}`. The gaussian noise `\epsilon` is set with mean 0 and variance. The noise is added to the linear predictor `\eta` = `x\beta`. Default is `snr = 10`. This option is invalid for `cortype = 3`.
`censoring`	Whether data is censored or not. Valid only for `family = "cox"`. Default is `TRUE`.
`c`	The censoring rate. Default is `1`.
`scal`	A parameter in generating survival time based on the Weibull distribution. Only used for the "`cox`" family.
`sigma`	A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance `\sigma^2`. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio. Valid only for `cortype = 3`.
`seed`	seed to be used in generating the random numbers.

Details

We generate an n \times p random Gaussian matrix X with mean 0 and a covariance matrix with an exponential structure or a constant structure. For the exponential structure, the covariance matrix has (i,j) entry equals rho^{|i-j|}. For the constant structure, the (i,j) entry of the covariance matrix is rho for every i \neq j and 1 elsewhere. For the moving average structure, For the design matrix X, we first generate an n \times p random Gaussian matrix \bar{X} whose entries are i.i.d. \sim N(0,1) and then normalize its columns to the \sqrt n length. Then the design matrix X is generated with X_j = \bar{X}_j + \rho(\bar{X}_{j+1}+\bar{X}_{j-1}) for j=2,\dots,p-1.

For family = "gaussian" , the data model is

Y = X \beta + \epsilon.

The underlying regression coefficient \beta has uniform distribution [m, 100m], m=5 \sqrt{2log(p)/n}.

For family= "binomial", the data model is

Prob(Y = 1) = \exp(X \beta + \epsilon)/(1 + \exp(X \beta + \epsilon)).

The underlying regression coefficient \beta has uniform distribution [2m, 10m], m = 5\sigma \sqrt{2log(p)/n}.

For family = "poisson" , the data is modeled to have an exponential distribution:

Y = Exp(\exp(X \beta + \epsilon)).

For family = "cox", the data model is

T = (-\log(S(t))/\exp(X \beta))^{1/scal}.

The centering time is generated from uniform distribution [0, c], then we define the censor status as \delta = I\{T \leq C\}, R = min\{T, C\}. The underlying regression coefficient \beta has uniform distribution [2m, 10m], m = 5\sigma \sqrt{2log(p)/n}. In the above models, \epsilon \sim N(0, \sigma^2 ), where \sigma^2 is determined by the snr.

Value

`x`	Design matrix of predictors.
`y`	Response variable.
`Tbeta`	The coefficients used in the underlying regression model.

Author(s)

Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.

Examples


# Generate simulated data
n <- 200
p <- 20
k <- 5
rho <- 0.4
SNR <- 10
cortype <- 1
seed <- 10
Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed)
x <- Data$x[1:140, ]
y <- Data$y[1:140]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200]
lambda.list <- exp(seq(log(5), log(0.1), length.out = 10))
lm.bsrr <- bsrr(x, y, method = "pgsection")

[Package bestridge version 1.0.7 Index]