R: Generate simulated data

gen.data {BeSS}

R Documentation

Generate simulated data

Description

Generate data for simulations under the generalized linear model and Cox model.

Usage

  gen.data(n, p, family, K, rho = 0, sigma = 1, beta = NULL, censoring = TRUE,
           c = 1, scal)

Arguments

`n`	The number of observations.
`p`	The number of predictors of interest.
`family`	The distribution of the simulated data. "`gaussian`" for gaussian data."`binomial`" for binary data. "`cox`" for survival data
`K`	The number of nonzero coefficients in the underlying regression model.
`rho`	A parameter used to characterize the pairwise correlation in predictors. Default is 0.
`sigma`	A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance `\sigma^2`. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio.
`beta`	The coefficient values in the underlying regression model.
`censoring`	Whether data is censored or not. Default is TRUE
`c`	The censoring rate. Default is 1.
`scal`	A parameter in generating survival time based on the Weibull distribution. Only used for the "`cox`" family.

Details

For the design matrix X, we first generate an n x p random Gaussian matrix \bar{X} whose entries are i.i.d. \sim N(0,1) and then normalize its columns to the \sqrt n length. Then the design matrix X is generated with X_j = \bar{X}_j + \rho(\bar{X}_{j+1}+\bar{X}_{j-1}) for j=2,\dots,p-1.

For "gaussian" family, the data model is

Y = X \beta + \epsilon, where \epsilon \sim N(0, \sigma^2 ).

The underlying regression coefficient \beta has uniform distribution [m, 100m], m=5 \sqrt{2log(p)/n}.

For "binomial" family, the data model is

Prob(Y = 1) = exp(X \beta)/(1 + exp(X \beta))

The underlying regression coefficient \beta has uniform distribution [2m, 10m], m = 5\sigma \sqrt{2log(p)/n}.

For "cox" family, the data model is

T = (-log(S(t))/exp(X \beta))^(1/scal),

The centerning time C is generated from uniform distribution [0, c], then we define the censor status as \delta = I{T <= C}, R = min{T, C}. The underlying regression coefficient \beta has uniform distribution [2m, 10m], m = 5\sigma \sqrt{2log(p)/n}.

Value

A list with the following components: x, y, Tbeta.

`x`	Design matrix of predictors.
`y`	Response variable
`Tbeta`	The coefficients used in the underlying regression model.

Author(s)

Canhong Wen, Aijun Zhang, Shijie Quan, and Xueqin Wang.

References

Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.

Examples


# Generate simulated data
n <- 500
p <- 20
K <-10
sigma <- 1
rho <- 0.2
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)

# Best subset selection
fit <- bess(data$x, data$y, family = "gaussian")

[Package BeSS version 2.0.4 Index]