Gen_Data {SMLE}R Documentation

Data simulator for high-dimensional GLMs

Description

This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.

Usage

Gen_Data(
  n = 200,
  p = 1000,
  sigma = 1,
  num_ctgidx = NULL,
  pos_ctgidx = NULL,
  num_truecoef = NULL,
  pos_truecoef = NULL,
  level_ctgidx = NULL,
  effect_truecoef = NULL,
  correlation = c("ID", "AR", "MA", "CS"),
  rho = 0.2,
  family = c("gaussian", "binomial", "poisson")
)

Arguments

n

Sample size, number of rows for the feature matrix to be generated.

p

Number of columns for the feature matrix to be generated.

sigma

Parameter for noise level.

num_ctgidx

The number of features that are categorical. Set to FALSE for only numerical features. Default is FALSE.

pos_ctgidx

Vector of indices denoting which columns are categorical.

num_truecoef

The number of features (columns) that affect response. Default is 5.

pos_truecoef

Vector of indices denoting which features (columns) affect the response variable. If not specified, positions are randomly sampled. See Details for more information.

level_ctgidx

Vector to indicate the number of levels for the categorical features in pos_ctgidx. Default is 2.

effect_truecoef

Effect size corresponding to the features in pos_truecoef. If not specified, effect size is sampled based on a uniform distribution and direction is randomly sampled. See Details.

correlation

Correlation structure among features. correlation = "ID" for independent, correlation = 'MA' for moving average, correlation = "CS" for compound symmetry, correlation = "AR" for auto regressive. Default is "ID". For more information see Details.

rho

Parameter controlling the correlation strength, default is 0.2. See Details.

family

Model type for the response variable. "gaussian" for normally distributed data, poisson for non-negative counts, "binomial" for binary (0-1).

Details

Simulated data (y_i , x_i) where x_i = (x_{i1},x_{i2} , . . . , x_{ip}) are generated as follows: First, we generate a p by 1 model coefficient vector beta with all entries being zero, except for the positions specified in pos_truecoef, on which effect_truecoef is used. When pos_truecoef is not specified, we randomly choose num_truecoef positions from the coefficient vector. When effect_truecoef is not specified, we randomly set the strength of the true model coefficients as follow:

(0.5+U) Z,

where U is sampled from a uniform distribution from 0 to 1, and Z is sampled from a binomial distribution P(Z=1)=1/2,P(Z=-1)=1/2.

Next, we generate a n by p feature matrix X according to the model selected with correlation and specified as follows.

Independent (ID): all features are independently generated from N( 0, 1).

Moving average (MA): candidate features x_1,..., x_p are joint normal, marginally N( 0, 1), with

cov(x_j, x_{j-1}) = \rho, cov(x_j, x_{j-2}) = \rho/2 and cov(x_j, x_h) = 0 for |j-h|>3.

Compound symmetry (CS): candidate features x_1,..., x_p are joint normal, marginally N( 0, 1), with cov(x_j, x_h) =\rho/2 if j, h are both in the set of important features and cov(x_j, x_h)=\rho when only one of j or h are in the set of important features.

Auto-regressive (AR): candidate features x_1,..., x_p are joint normal, marginally N( 0, 1), with

cov(x_j, x_h) = \rho^{|j-h|} for all j and h. The correlation strength \rho is controlled by the argument rho.

Then, we generate the response variable Y according to its response type, which is controlled by the argument family For the Gaussian model, y_i =x_i\beta + \epsilon_i where \epsilon_i is N( 0, 1) for i from 1 to n. For the binary model let \pi_i = P(Y = 1|x_i). We sample y_i from Bernoulli(\pi_i) where logit(\pi_i) = x_i \beta. Finally, for the Poisson model, y_i is generated from the Poisson distribution with the link \pi_i = exp(x_i\beta ). For more details see the reference below.

Value

call

The call that produced this object.

Y

Response variable vector of length n.

X

Feature matrix or data.frame (matrix if num_ctgidx =FALSE and data.frame otherwise).

subset_true

Vector of column indices of X for the features that affect the response variables (relevant features).

coef_true

Vector of effects for the features that affect the response variables.

categorical

Logical flag whether the model contains categorical features.

CI

Indices of categorical features when categorical = TRUE.

rho,family,correlation are return of arguments passed in the function call.

References

Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269

Examples


#Simulating data with binomial response and auto-regressive structure.
set.seed(1)
Data <- Gen_Data(n = 500, p = 2000, family = "binomial", correlation = "AR")
cor(Data$X[,1:5])
print(Data)



[Package SMLE version 2.1-1 Index]