sieve_preprocess {Sieve} | R Documentation |
Preprocess the original data for sieve estimation.
Description
Generate the design matrix for the downstream lasso-type penalized model fitting.
Usage
sieve_preprocess(
X,
basisN = NULL,
maxj = NULL,
type = "cosine",
interaction_order = 3,
index_matrix = NULL,
norm_feature = TRUE,
norm_para = NULL
)
Arguments
X |
a data frame containing original features. The (i,j)-th element is the j-th dimension of the i-th sample's feature vector. So the number of rows equals to the sample size and the number of columns equals to the feature dimension. |
basisN |
number of sieve basis function. It is in general larger than the dimension of the original feature.
Default is 50*dimension of original feature. A larger value has a smaller approximation error but it is harder to estimate.
The computational time/memory requirement should scale linearly to |
maxj |
a number. the maximum index product of the basis function. A larger value means more basisN. If basisN is already specified, do not need to provide value for this argument. |
type |
a string. It specifies what kind of basis functions are used. The default is (aperiodic) cosine basis functions, which is suitable for most purpose. |
interaction_order |
a number. It also controls the model complexity. 1 means fitting an additive model, 2 means fitting a model allows, 3 means interaction terms between 3 dimensions of the feature, etc. The default is 3. For large sample size, lower dimension problems, try a larger value (but need to be smaller than the dimension of original features); for smaller sample size and higher dimensional problems, try set it to a smaller value (1 or 2). |
index_matrix |
a matrix. provide a pre-generated index matrix. The default is NULL, meaning sieve_preprocess will generate one for the user. |
norm_feature |
a logical variable. Default is TRUE. It means sieve_preprocess will rescale the each dimension of features to 0 and 1. Only set to FALSE when user already manually rescale them between 0 and 1. |
norm_para |
a matrix. It specifies how the features are normalized. For training data, use the default value NULL. |
Value
A list containing the necessary information for next step model fitting. Typically, the list is used as the main input of Sieve::sieve_solver.
Phi |
a matrix. This is the design matrix directly used by the next step model fitting. The (i,j)-th element of this matrix is the evaluation of i-th sample's feature at the j-th basis function. The dimension of this matrix is sample size x basisN. |
X |
a matrix. This is the rescaled original feature/predictor matrix. |
type |
a string. The type of basis funtion. |
index_matrix |
a matrix. It specifies what are the product basis functions used when constructing the design matrix Phi. It has a dimension basisN x dimension of original features. There are at most interaction_order many non-1 elements in each row. |
basisN |
a number. Number of sieve basis functions. |
norm_para |
a matrix. It records how each dimension of the feature/predictor is rescaled, which is useful when rescaling the testing sample's predictors. |
Examples
xdim <- 1 #1 dimensional feature
#generate 1000 training samples
TrainData <- GenSamples(s.size = 1000, xdim = xdim)
#use 50 cosine basis functions
type <- 'cosine'
basisN <- 50
sieve.model <- sieve_preprocess(X = TrainData[,2:(xdim+1)],
basisN = basisN, type = type)
#sieve.model$Phi #Phi is the design matrix
xdim <- 5 #1 dimensional feature
#generate 1000 training samples
#only the first two dimensions are truly associated with the outcome
TrainData <- GenSamples(s.size = 1000, xdim = xdim,
frho = 'additive', frho.para = 2)
#use 1000 basis functions
#each of them is a product of univariate cosine functions.
type <- 'cosine'
basisN <- 1000
sieve.model <- sieve_preprocess(X = TrainData[,2:(xdim+1)],
basisN = basisN, type = type)
#sieve.model$Phi #Phi is the design matrix
#fit a nonaprametric additive model by setting interaction_order = 1
sieve.model <- sieve_preprocess(X = TrainData[,2:(xdim+1)],
basisN = basisN, type = type,
interaction_order = 1)
#sieve.model$index_matrix #for each row, there is at most one entry >= 2.
#this means there are no basis functions varying in more than 2-dimensions
#that is, we are fitting additive models without interaction between features.