gen.data {BeSS} R Documentation

## Generate simulated data

### Description

Generate data for simulations under the generalized linear model and Cox model.

### Usage

  gen.data(n, p, family, K, rho = 0, sigma = 1, beta = NULL, censoring = TRUE,
c = 1, scal)


### Arguments

 n The number of observations. p The number of predictors of interest. family The distribution of the simulated data. "gaussian" for gaussian data."binomial" for binary data. "cox" for survival data K The number of nonzero coefficients in the underlying regression model. rho A parameter used to characterize the pairwise correlation in predictors. Default is 0. sigma A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance \sigma^2. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio. beta The coefficient values in the underlying regression model. censoring Whether data is censored or not. Default is TRUE c The censoring rate. Default is 1. scal A parameter in generating survival time based on the Weibull distribution. Only used for the "cox" family.

### Details

For the design matrix X, we first generate an n x p random Gaussian matrix \bar{X} whose entries are i.i.d. \sim N(0,1) and then normalize its columns to the \sqrt n length. Then the design matrix X is generated with X_j = \bar{X}_j + \rho(\bar{X}_{j+1}+\bar{X}_{j-1}) for j=2,\dots,p-1.

For "gaussian" family, the data model is

Y = X \beta + \epsilon, where \epsilon \sim N(0, \sigma^2 ).

The underlying regression coefficient \beta has uniform distribution [m, 100m], m=5 \sqrt{2log(p)/n}.

For "binomial" family, the data model is

Prob(Y = 1) = exp(X \beta)/(1 + exp(X \beta))

The underlying regression coefficient \beta has uniform distribution [2m, 10m], m = 5\sigma \sqrt{2log(p)/n}.

For "cox" family, the data model is

T = (-log(S(t))/exp(X \beta))^(1/scal),

The centerning time C is generated from uniform distribution [0, c], then we define the censor status as \delta = I{T <= C}, R = min{T, C}. The underlying regression coefficient \beta has uniform distribution [2m, 10m], m = 5\sigma \sqrt{2log(p)/n}.

### Value

A list with the following components: x, y, Tbeta.

 x Design matrix of predictors. y Response variable Tbeta The coefficients used in the underlying regression model.

### Author(s)

Canhong Wen, Aijun Zhang, Shijie Quan, and Xueqin Wang.

### References

Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.

### Examples


# Generate simulated data
n <- 500
p <- 20
K <-10
sigma <- 1
rho <- 0.2
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)

# Best subset selection
fit <- bess(data$x, data$y, family = "gaussian")



[Package BeSS version 2.0.3 Index]