simulateSIRS {VariableScreening} | R Documentation |
Simulate a dataset for demonstrating the performance of screenIID with the SIRS method
Description
Simulates a dataset that can be used to demonstrate variable screening for ultrahigh-dimensional regression with the SIRS option in screenIID. The simulated dataset has p numerical predictors X and a categorical Y-response. The data-generating scenario is a simplified version of Example 1 of Zhu, Li, Li and Zhu (2011). Specifically, the X covariates are normally distributed with mean zero and variance one, and may be correlated if the argument rho is set to a nonzero value. The response Y is generated as Y = c*X1 + 0.8*c*X2 + 0.6*c*X3 + 0.4*c*X4 + 0.5*c*X5 + sigma*e. where c is the argument SignalStrength, e is either a standard normal distribution (if HeavyTailedResponse==FALSE) or t distribution with 1 degree of freedom (if HeavyTailedResponse==TRUE). sigma is either sqrt(6.83) if heteroskedastic==FALSE, or else exp(X20+X21+X22) if heteroskedastic=TRUE.
Usage
simulateSIRS(
n = 200,
p = 5000,
rho = 0,
HeavyTailedResponse = TRUE,
heteroskedastic = TRUE,
SignalStrength = 1
)
Arguments
n |
Number of subjects in the dataset to be simulated. It will also equal to the number of rows in the dataset to be simulated, because it is assumed that each row represents a different independent and identically distributed subject. |
p |
Number of predictor variables (covariates) in the simulated dataset. These covariates will be the features screened by DC-SIS. |
rho |
The correlation between adjacent covariates in the simulated matrix X. The within-subject covariance matrix of X is assumed to has the same form as an AR(1) autoregressive covariance matrix, although this is not meant to imply that the X covariates for each subject are in fact a time series. Instead, it is just used as an example of a parsimonious but nontrivial covariance structure. If rho is left at the default of zero, the X covariates will be independent and the simulation will run faster. |
HeavyTailedResponse |
If this is true, Y residuals will be generated to have much heavier tails (more unusually high or low values) then a normal distribution would have. |
heteroskedastic |
Whether the error variance should be allowed to depend on one of the predictor variables. |
SignalStrength |
A constant used in the simulation to increase or decrease the signal-to-noise ratio; it was set to 0.5, 1, or 2 for weaker, medium or stronger signal. |
Value
A list with following components: X Matrix of predictors to be screened. It will have n rows and p columns. Y Vector of responses. It will have length n.
References
Zhu, L.-P., Li, L., Li, R., & Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106, 1464-1475. <DOI:10.1198/jasa.2011.tm10563>
Examples
set.seed(12345678)
results <- simulateSIRS()