Gen_Data {SMLE} | R Documentation |
Data simulator for high-dimensional GLMs
Description
This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.
Usage
Gen_Data(
n = 200,
p = 1000,
sigma = 1,
num_ctgidx = NULL,
pos_ctgidx = NULL,
num_truecoef = NULL,
pos_truecoef = NULL,
level_ctgidx = NULL,
effect_truecoef = NULL,
correlation = c("ID", "AR", "MA", "CS"),
rho = 0.2,
family = c("gaussian", "binomial", "poisson")
)
Arguments
n |
Sample size, number of rows for the feature matrix to be generated. |
p |
Number of columns for the feature matrix to be generated. |
sigma |
Parameter for noise level. |
num_ctgidx |
The number of features that are categorical. Set to |
pos_ctgidx |
Vector of indices denoting which columns are categorical. |
num_truecoef |
The number of features (columns) that affect response. Default is 5. |
pos_truecoef |
Vector of indices denoting which features (columns) affect the response variable. If not specified, positions are randomly sampled. See Details for more information. |
level_ctgidx |
Vector to indicate the number of levels for the categorical features in |
effect_truecoef |
Effect size corresponding to the features in |
correlation |
Correlation structure among features. |
rho |
Parameter controlling the correlation strength, default is |
family |
Model type for the response variable.
|
Details
Simulated data where
are generated as follows:
First, we generate a
by
model coefficient vector beta with all
entries being zero, except for the positions specified in
pos_truecoef
,
on which effect_truecoef
is used. When pos_truecoef
is not specified,
we randomly choose num_truecoef
positions from the coefficient
vector. When effect_truecoef
is not specified, we randomly set the strength
of the true model coefficients as follow:
where is sampled from a uniform distribution from 0 to 1,
and
is sampled from a binomial distribution
.
Next, we generate a by
feature matrix
according to the model selected with
correlation
and specified as follows.
Independent (ID): all features are independently generated from .
Moving average (MA): candidate features are joint normal,
marginally
, with
,
and
for
.
Compound symmetry (CS): candidate features are joint normal,
marginally
, with
if
,
are both in the set of important features and
when only
one of
or
are in the set of important features.
Auto-regressive (AR): candidate features are joint normal, marginally
, with
for all
and
. The correlation strength
is controlled by the argument
rho
.
Then, we generate the response variable according to its response type, which is controlled by the argument
family
For the Gaussian model, where
is
for
from
to
.
For the binary model let
. We sample
from Bernoulli(
) where logit
.
Finally, for the Poisson model,
is generated from the Poisson distribution with the link
= exp
.
For more details see the reference below.
Value
call |
The call that produced this object. |
Y |
Response variable vector of length |
X |
Feature matrix or data.frame (matrix if |
subset_true |
Vector of column indices of X for the features that affect the response variables (relevant features). |
coef_true |
Vector of effects for the features that affect the response variables. |
categorical |
Logical flag whether the model contains categorical features. |
CI |
Indices of categorical features when |
rho,family,correlation are return of arguments passed in the function call.
References
Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269
Examples
#Simulating data with binomial response and auto-regressive structure.
set.seed(1)
Data <- Gen_Data(n = 500, p = 2000, family = "binomial", correlation = "AR")
cor(Data$X[,1:5])
print(Data)