R: Data simulator for high-dimensional GLMs

Gen_Data {SMLE}

R Documentation

Data simulator for high-dimensional GLMs

Description

This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.

Usage

Gen_Data(
  n = 200,
  p = 1000,
  sigma = 1,
  num_ctgidx = NULL,
  pos_ctgidx = NULL,
  num_truecoef = NULL,
  pos_truecoef = NULL,
  level_ctgidx = NULL,
  effect_truecoef = NULL,
  correlation = c("ID", "AR", "MA", "CS"),
  rho = 0.2,
  family = c("gaussian", "binomial", "poisson")
)

Arguments

`n`	Sample size, number of rows for the feature matrix to be generated.
`p`	Number of columns for the feature matrix to be generated.
`sigma`	Parameter for noise level.
`num_ctgidx`	The number of features that are categorical. Set to `FALSE` for only numerical features. Default is `FALSE`.
`pos_ctgidx`	Vector of indices denoting which columns are categorical.
`num_truecoef`	The number of features (columns) that affect response. Default is 5.
`pos_truecoef`	Vector of indices denoting which features (columns) affect the response variable. If not specified, positions are randomly sampled. See Details for more information.
`level_ctgidx`	Vector to indicate the number of levels for the categorical features in `pos_ctgidx`. Default is 2.
`effect_truecoef`	Effect size corresponding to the features in `pos_truecoef`. If not specified, effect size is sampled based on a uniform distribution and direction is randomly sampled. See Details.
`correlation`	Correlation structure among features. `correlation = "ID"` for independent, `correlation = 'MA'` for moving average, `correlation = "CS"` for compound symmetry, `correlation = "AR"` for auto regressive. Default is `"ID"`. For more information see Details.
`rho`	Parameter controlling the correlation strength, default is `0.2`. See Details.
`family`	Model type for the response variable. `"gaussian"` for normally distributed data, `poisson` for non-negative counts, `"binomial"` for binary (0-1).

Details

Simulated data (y_i , x_i) where x_i = (x_{i1},x_{i2} , . . . , x_{ip}) are generated as follows: First, we generate a p by 1 model coefficient vector beta with all entries being zero, except for the positions specified in pos_truecoef, on which effect_truecoef is used. When pos_truecoef is not specified, we randomly choose num_truecoef positions from the coefficient vector. When effect_truecoef is not specified, we randomly set the strength of the true model coefficients as follow:

(0.5+U) Z,

where U is sampled from a uniform distribution from 0 to 1, and Z is sampled from a binomial distribution P(Z=1)=1/2,P(Z=-1)=1/2.

Next, we generate a n by p feature matrix X according to the model selected with correlation and specified as follows.

Independent (ID): all features are independently generated from N( 0, 1).

Moving average (MA): candidate features x_1,..., x_p are joint normal, marginally N( 0, 1), with

cov(x_j, x_{j-1}) = \rho, cov(x_j, x_{j-2}) = \rho/2 and cov(x_j, x_h) = 0 for |j-h|>3.

Compound symmetry (CS): candidate features x_1,..., x_p are joint normal, marginally N( 0, 1), with cov(x_j, x_h) =\rho/2 if j, h are both in the set of important features and cov(x_j, x_h)=\rho when only one of j or h are in the set of important features.

Auto-regressive (AR): candidate features x_1,..., x_p are joint normal, marginally N( 0, 1), with

cov(x_j, x_h) = \rho^{|j-h|} for all j and h. The correlation strength \rho is controlled by the argument rho.

Then, we generate the response variable Y according to its response type, which is controlled by the argument family For the Gaussian model, y_i =x_i\beta + \epsilon_i where \epsilon_i is N( 0, 1) for i from 1 to n. For the binary model let \pi_i = P(Y = 1|x_i). We sample y_i from Bernoulli(\pi_i) where logit(\pi_i) = x_i \beta. Finally, for the Poisson model, y_i is generated from the Poisson distribution with the link \pi_i = exp(x_i\beta ). For more details see the reference below.

Value

`call`	The call that produced this object.
`Y`	Response variable vector of length `n`.
`X`	Feature matrix or data.frame (matrix if `num_ctgidx =FALSE` and data.frame otherwise).
`subset_true`	Vector of column indices of X for the features that affect the response variables (relevant features).
`coef_true`	Vector of effects for the features that affect the response variables.
`categorical`	Logical flag whether the model contains categorical features.
`CI`	Indices of categorical features when `categorical = TRUE`.

rho,family,correlation are return of arguments passed in the function call.

References

Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269

Examples


#Simulating data with binomial response and auto-regressive structure.
set.seed(1)
Data <- Gen_Data(n = 500, p = 2000, family = "binomial", correlation = "AR")
cor(Data$X[,1:5])
print(Data)

[Package SMLE version 2.1-1 Index]