gen.data {bestridge}R Documentation

Generate simulated data

Description

Generate data for simulations under the generalized linear model and Cox model.

Usage

gen.data(
  n,
  p,
  k = NULL,
  rho = 0,
  family = c("gaussian", "binomial", "poisson", "cox"),
  beta = NULL,
  cortype = 1,
  snr = 10,
  censoring = TRUE,
  c = 1,
  scal,
  sigma = 1,
  seed = 1
)

Arguments

n

The number of observations.

p

The number of predictors of interest.

k

The number of nonzero coefficients in the underlying regression model. Can be omitted if beta is supplied.

rho

A parameter used to characterize the pairwise correlation in predictors. Default is 0.

family

The distribution of the simulated data. "gaussian" for gaussian data."binomial" for binary data. "poisson" for count data. "cox" for survival data.

beta

The coefficient values in the underlying regression model.

cortype

The correlation structure. cortype = 1 denotes the exponential structure, where the covariance matrix has (i,j)(i,j) entry equals rhoijrho^{|i-j|}. codecortype = 2 denotes the constant structure, where the (i,j)(i,j) entry of covariance matrix is rhorho for every iji \neq j and 1 elsewhere. cortype = 3 denotes the moving average structure. Details can be found below.

snr

A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as as the variance of xβx\beta divided by the variance of a gaussian noise: Var(xβ)σ2\frac{Var(x\beta)}{\sigma^2}. The gaussian noise ϵ\epsilon is set with mean 0 and variance. The noise is added to the linear predictor η\eta = xβx\beta. Default is snr = 10. This option is invalid for cortype = 3.

censoring

Whether data is censored or not. Valid only for family = "cox". Default is TRUE.

c

The censoring rate. Default is 1.

scal

A parameter in generating survival time based on the Weibull distribution. Only used for the "cox" family.

sigma

A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance σ2\sigma^2. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio. Valid only for cortype = 3.

seed

seed to be used in generating the random numbers.

Details

We generate an n×pn \times p random Gaussian matrix XX with mean 0 and a covariance matrix with an exponential structure or a constant structure. For the exponential structure, the covariance matrix has (i,j)(i,j) entry equals rhoijrho^{|i-j|}. For the constant structure, the (i,j)(i,j) entry of the covariance matrix is rhorho for every iji \neq j and 1 elsewhere. For the moving average structure, For the design matrix XX, we first generate an n×pn \times p random Gaussian matrix Xˉ\bar{X} whose entries are i.i.d. N(0,1)\sim N(0,1) and then normalize its columns to the n\sqrt n length. Then the design matrix XX is generated with Xj=Xˉj+ρ(Xˉj+1+Xˉj1)X_j = \bar{X}_j + \rho(\bar{X}_{j+1}+\bar{X}_{j-1}) for j=2,,p1j=2,\dots,p-1.

For family = "gaussian" , the data model is

Y=Xβ+ϵ.Y = X \beta + \epsilon.

The underlying regression coefficient β\beta has uniform distribution [m, 100m], m=52log(p)/n.m=5 \sqrt{2log(p)/n}.

For family= "binomial", the data model is

Prob(Y=1)=exp(Xβ+ϵ)/(1+exp(Xβ+ϵ)).Prob(Y = 1) = \exp(X \beta + \epsilon)/(1 + \exp(X \beta + \epsilon)).

The underlying regression coefficient β\beta has uniform distribution [2m, 10m], m=5σ2log(p)/n.m = 5\sigma \sqrt{2log(p)/n}.

For family = "poisson" , the data is modeled to have an exponential distribution:

Y=Exp(exp(Xβ+ϵ)).Y = Exp(\exp(X \beta + \epsilon)).

For family = "cox", the data model is

T=(log(S(t))/exp(Xβ))1/scal.T = (-\log(S(t))/\exp(X \beta))^{1/scal}.

The centering time is generated from uniform distribution [0,c][0, c], then we define the censor status as δ=I{TC},R=min{T,C}\delta = I\{T \leq C\}, R = min\{T, C\}. The underlying regression coefficient β\beta has uniform distribution [2m, 10m], m=5σ2log(p)/n.m = 5\sigma \sqrt{2log(p)/n}. In the above models, ϵN(0,σ2),\epsilon \sim N(0, \sigma^2 ), where σ2\sigma^2 is determined by the snr.

Value

x

Design matrix of predictors.

y

Response variable.

Tbeta

The coefficients used in the underlying regression model.

Author(s)

Liyuan Hu, Kangkang Jiang, Yanhang Zhang, Jin Zhu, Canhong Wen and Xueqin Wang.

See Also

bsrr, predict.bsrr.

Examples


# Generate simulated data
n <- 200
p <- 20
k <- 5
rho <- 0.4
SNR <- 10
cortype <- 1
seed <- 10
Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed)
x <- Data$x[1:140, ]
y <- Data$y[1:140]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200]
lambda.list <- exp(seq(log(5), log(0.1), length.out = 10))
lm.bsrr <- bsrr(x, y, method = "pgsection")

[Package bestridge version 1.0.7 Index]