Gen_Data {SMLE} | R Documentation |
Data simulator for high-dimensional GLMs
Description
This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.
Usage
Gen_Data(
n = 200,
p = 1000,
sigma = 1,
num_ctgidx = NULL,
pos_ctgidx = NULL,
num_truecoef = NULL,
pos_truecoef = NULL,
level_ctgidx = NULL,
effect_truecoef = NULL,
correlation = c("ID", "AR", "MA", "CS"),
rho = 0.2,
family = c("gaussian", "binomial", "poisson")
)
Arguments
n |
Sample size, number of rows for the feature matrix to be generated. |
p |
Number of columns for the feature matrix to be generated. |
sigma |
Parameter for noise level. |
num_ctgidx |
The number of features that are categorical. Set to |
pos_ctgidx |
Vector of indices denoting which columns are categorical. |
num_truecoef |
The number of features (columns) that affect response. Default is 5. |
pos_truecoef |
Vector of indices denoting which features (columns) affect the response variable. If not specified, positions are randomly sampled. See Details for more information. |
level_ctgidx |
Vector to indicate the number of levels for the categorical features in |
effect_truecoef |
Effect size corresponding to the features in |
correlation |
Correlation structure among features. |
rho |
Parameter controlling the correlation strength, default is |
family |
Model type for the response variable.
|
Details
Simulated data (y_i , x_i)
where x_i = (x_{i1},x_{i2} , . . . , x_{ip})
are generated as follows:
First, we generate a p
by 1
model coefficient vector beta with all
entries being zero, except for the positions specified in pos_truecoef
,
on which effect_truecoef
is used. When pos_truecoef
is not specified,
we randomly choose num_truecoef
positions from the coefficient
vector. When effect_truecoef
is not specified, we randomly set the strength
of the true model coefficients as follow:
(0.5+U) Z,
where U
is sampled from a uniform distribution from 0 to 1,
and Z
is sampled from a binomial distribution P(Z=1)=1/2,P(Z=-1)=1/2
.
Next, we generate a n
by p
feature matrix X
according to the model selected with
correlation
and specified as follows.
Independent (ID): all features are independently generated from N( 0, 1)
.
Moving average (MA): candidate features x_1,..., x_p
are joint normal,
marginally N( 0, 1)
, with
cov(x_j, x_{j-1}) = \rho
, cov(x_j, x_{j-2}) = \rho/2
and cov(x_j, x_h) = 0
for |j-h|>3
.
Compound symmetry (CS): candidate features x_1,..., x_p
are joint normal,
marginally N( 0, 1)
, with cov(x_j, x_h) =\rho/2
if j
, h
are both in the set of important features and cov(x_j, x_h)=\rho
when only
one of j
or h
are in the set of important features.
Auto-regressive (AR): candidate features x_1,..., x_p
are joint normal, marginally N( 0, 1)
, with
cov(x_j, x_h) = \rho^{|j-h|}
for all j
and h
. The correlation strength \rho
is controlled by the argument rho
.
Then, we generate the response variable Y
according to its response type, which is controlled by the argument family
For the Gaussian model, y_i =x_i\beta + \epsilon_i
where \epsilon_i
is N( 0, 1)
for i
from 1
to n
.
For the binary model let \pi_i = P(Y = 1|x_i)
. We sample y_i
from Bernoulli(\pi_i
) where logit(\pi_i) = x_i \beta
.
Finally, for the Poisson model, y_i
is generated from the Poisson distribution with the link \pi_i
= exp(x_i\beta )
.
For more details see the reference below.
Value
call |
The call that produced this object. |
Y |
Response variable vector of length |
X |
Feature matrix or data.frame (matrix if |
subset_true |
Vector of column indices of X for the features that affect the response variables (relevant features). |
coef_true |
Vector of effects for the features that affect the response variables. |
categorical |
Logical flag whether the model contains categorical features. |
CI |
Indices of categorical features when |
rho,family,correlation are return of arguments passed in the function call.
References
Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269
Examples
#Simulating data with binomial response and auto-regressive structure.
set.seed(1)
Data <- Gen_Data(n = 500, p = 2000, family = "binomial", correlation = "AR")
cor(Data$X[,1:5])
print(Data)