gen.data {BeSS}R Documentation

Generate simulated data

Description

Generate data for simulations under the generalized linear model and Cox model.

Usage

  gen.data(n, p, family, K, rho = 0, sigma = 1, beta = NULL, censoring = TRUE,
           c = 1, scal)

Arguments

n

The number of observations.

p

The number of predictors of interest.

family

The distribution of the simulated data. "gaussian" for gaussian data."binomial" for binary data. "cox" for survival data

K

The number of nonzero coefficients in the underlying regression model.

rho

A parameter used to characterize the pairwise correlation in predictors. Default is 0.

sigma

A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance σ^2. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio.

beta

The coefficient values in the underlying regression model.

censoring

Whether data is censored or not. Default is TRUE

c

The censoring rate. Default is 1.

scal

A parameter in generating survival time based on the Weibull distribution. Only used for the "cox" family.

Details

For the design matrix X, we first generate an n x p random Gaussian matrix \bar{X} whose entries are i.i.d. \sim N(0,1) and then normalize its columns to the √ n length. Then the design matrix X is generated with X_j = \bar{X}_j + ρ(\bar{X}_{j+1}+\bar{X}_{j-1}) for j=2,…,p-1.

For "gaussian" family, the data model is

Y = X β + ε, where ε \sim N(0, σ^2 ).

The underlying regression coefficient β has uniform distribution [m, 100m], m=5 √{2log(p)/n}.

For "binomial" family, the data model is

Prob(Y = 1) = exp(X β)/(1 + exp(X β))

The underlying regression coefficient β has uniform distribution [2m, 10m], m = 5σ √{2log(p)/n}.

For "cox" family, the data model is

T = (-log(S(t))/exp(X β))^(1/scal),

The centerning time C is generated from uniform distribution [0, c], then we define the censor status as δ = I{T <= C}, R = min{T, C}. The underlying regression coefficient β has uniform distribution [2m, 10m], m = 5σ √{2log(p)/n}.

Value

A list with the following components: x, y, Tbeta.

x

Design matrix of predictors.

y

Response variable

Tbeta

The coefficients used in the underlying regression model.

Author(s)

Canhong Wen, Aijun Zhang, Shijie Quan, and Xueqin Wang.

References

Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.

Examples


# Generate simulated data
n <- 500
p <- 20
K <-10
sigma <- 1
rho <- 0.2
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)

# Best subset selection
fit <- bess(data$x, data$y, family = "gaussian")



[Package BeSS version 2.0.3 Index]