Gen_Data {SMLE}R Documentation

Data simulator for high-dimensional GLMs


This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.


  n = 200,
  p = 1000,
  sigma = 1,
  num_ctgidx = NULL,
  pos_ctgidx = NULL,
  num_truecoef = NULL,
  pos_truecoef = NULL,
  level_ctgidx = NULL,
  effect_truecoef = NULL,
  correlation = c("ID", "AR", "MA", "CS"),
  rho = 0.2,
  family = c("gaussian", "binomial", "poisson")



Sample size, number of rows for the feature matrix to be generated.


Number of columns for the feature matrix to be generated.


Parameter for noise level.


The number of features that are categorical. Set to FALSE for only numerical features. Default is FALSE.


Vector of indices denoting which columns are categorical.


The number of features (columns) that affect response. Default is 5.


Vector of indices denoting which features (columns) affect the response variable. If not specified, positions are randomly sampled. See Details for more information.


Vector to indicate the number of levels for the categorical features in pos_ctgidx. Default is 2.


Effect size corresponding to the features in pos_truecoef. If not specified, effect size is sampled based on a uniform distribution and direction is randomly sampled. See Details.


Correlation structure among features. correlation = "ID" for independent, correlation = 'MA' for moving average, correlation = "CS" for compound symmetry, correlation = "AR" for auto regressive. Default is "ID". For more information see Details.


Parameter controlling the correlation strength, default is 0.2. See Details.


Model type for the response variable. "gaussian" for normally distributed data, poisson for non-negative counts, "binomial" for binary (0-1).


Simulated data (yi,xi)(y_i , x_i) where xi=(xi1,xi2,...,xip) x_i = (x_{i1},x_{i2} , . . . , x_{ip}) are generated as follows: First, we generate a pp by 11 model coefficient vector beta with all entries being zero, except for the positions specified in pos_truecoef, on which effect_truecoef is used. When pos_truecoef is not specified, we randomly choose num_truecoef positions from the coefficient vector. When effect_truecoef is not specified, we randomly set the strength of the true model coefficients as follow:

(0.5+U)Z,(0.5+U) Z,

where UU is sampled from a uniform distribution from 0 to 1, and ZZ is sampled from a binomial distribution P(Z=1)=1/2,P(Z=1)=1/2P(Z=1)=1/2,P(Z=-1)=1/2.

Next, we generate a nn by pp feature matrix XX according to the model selected with correlation and specified as follows.

Independent (ID): all features are independently generated from N(0,1)N( 0, 1).

Moving average (MA): candidate features x1,...,xpx_1,..., x_p are joint normal, marginally N(0,1)N( 0, 1), with

cov(xj,xj1)=ρcov(x_j, x_{j-1}) = \rho, cov(xj,xj2)=ρ/2cov(x_j, x_{j-2}) = \rho/2 and cov(xj,xh)=0cov(x_j, x_h) = 0 for jh>3|j-h|>3.

Compound symmetry (CS): candidate features x1,...,xpx_1,..., x_p are joint normal, marginally N(0,1)N( 0, 1), with cov(xj,xh)=ρ/2cov(x_j, x_h) =\rho/2 if jj, hh are both in the set of important features and cov(xj,xh)=ρcov(x_j, x_h)=\rho when only one of jj or hh are in the set of important features.

Auto-regressive (AR): candidate features x1,...,xpx_1,..., x_p are joint normal, marginally N(0,1)N( 0, 1), with

cov(xj,xh)=ρjhcov(x_j, x_h) = \rho^{|j-h|} for all jj and hh. The correlation strength ρ\rho is controlled by the argument rho.

Then, we generate the response variable YY according to its response type, which is controlled by the argument family For the Gaussian model, yi=xiβ+ϵiy_i =x_i\beta + \epsilon_i where ϵi\epsilon_i is N(0,1)N( 0, 1) for ii from 11 to nn. For the binary model let πi=P(Y=1xi)\pi_i = P(Y = 1|x_i). We sample yiy_i from Bernoulli(πi\pi_i) where logit(πi)=xiβ(\pi_i) = x_i \beta. Finally, for the Poisson model, yiy_i is generated from the Poisson distribution with the link πi\pi_i = exp(xiβ)(x_i\beta ). For more details see the reference below.



The call that produced this object.


Response variable vector of length nn.


Feature matrix or data.frame (matrix if num_ctgidx =FALSE and data.frame otherwise).


Vector of column indices of X for the features that affect the response variables (relevant features).


Vector of effects for the features that affect the response variables.


Logical flag whether the model contains categorical features.


Indices of categorical features when categorical = TRUE.

rho,family,correlation are return of arguments passed in the function call.


Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269


#Simulating data with binomial response and auto-regressive structure.
Data <- Gen_Data(n = 500, p = 2000, family = "binomial", correlation = "AR")

[Package SMLE version 2.1-1 Index]