GeneSelectR {GeneSelectR} | R Documentation |
Gene Selection and Evaluation with GeneSelectR
Description
This function performs gene selection using different methods on a given training set and evaluates their performance using cross-validation. Optionally, it also calculates permutation feature importances.
Usage
GeneSelectR(
X,
y,
pipelines = NULL,
custom_fs_methods = NULL,
selected_methods = NULL,
custom_fs_grids = NULL,
classifier = NULL,
classifier_grid = NULL,
preprocessing_steps = NULL,
testsize = 0.2,
validsize = 0.2,
scoring = "accuracy",
njobs = -1,
n_splits = 5,
search_type = "random",
n_iter = 10,
max_features = 50,
calculate_permutation_importance = FALSE,
perform_test_split = FALSE,
random_state = NULL
)
Arguments
X |
A matrix or data frame with features as columns and observations as rows. |
y |
A vector of labels corresponding to the rows of X_train. |
pipelines |
An optional list of pre-defined pipelines to use for fitting and evaluation. If this argument is provided, the feature selection methods and preprocessing steps will be ignored. |
custom_fs_methods |
An optional list of feature selection methods to use for fitting and evaluation. If this argument is not provided, a default set of feature selection methods will be used. |
selected_methods |
An optional vector of names of feature selection methods to use from the default set. If this argument is provided, only the specified methods will be used. |
custom_fs_grids |
An optional list of hyperparameter grids for the feature selection methods. Each element of the list should be a named list of parameters for a specific feature selection method. The names of the elements should match the names of the feature selection methods. If this argument is provided, the function will perform hyperparameter tuning for the specified feature selection methods in addition to the final estimator. |
classifier |
An optional sklearn classifier. If left NULL then sklearn RandomForestClassifier is used. |
classifier_grid |
An optional named list of classifier parameters. If none are provided then default grid is used (check vignette for exact params). |
preprocessing_steps |
An optional named list of sklearn preprocessing procedures. If none provided defaults are used (check vignette for exact params). |
testsize |
The size of the test set used in the evaluation. |
validsize |
The size of the validation set used in the evaluation. |
scoring |
A string representing what scoring metric to use for hyperparameter adjustment. Default value is 'accuracy' |
njobs |
Number of jobs to run in parallel. |
n_splits |
Number of train/test splits. |
search_type |
A string indicating the type of search to use. 'grid' for GridSearchCV and 'random' for RandomizedSearchCV. Default is 'random'. |
n_iter |
An integer indicating the number of parameter settings that are sampled in RandomizedSearchCV. Only applies when search_type is 'random'. |
max_features |
Maximum number of features to be selected by default feature selection methods. Max features cannot exceed the total number of features in a dataset. |
calculate_permutation_importance |
A boolean indicating whether to calculate permutation feature importance. Default is FALSE. |
perform_test_split |
Whether to perform train and test split, to have an evaluation on unseen test set. The default value is set to FALSE |
random_state |
An integer value setting the random seed for feature selection algorithms and cross validation procedure. By default set to NULL to use different random seed every time an algorithm is used. For reproducibility could be fixed, otherwise for an unbiased estimation should be left as NULL. |
Value
Returns an object of class PipelineResults
with the following elements:
@field best_pipeline: A list of the best-fitted pipelines for each feature selection method and data split.
@field cv_results: A list containing cross-validation results for each pipeline, including scores and other metrics.
@field inbuilt_feature_importance: A list of the inbuilt feature importance scores for each pipeline, aggregated across all data splits.
@field test_metrics: A data frame summarizing test metrics (precision, recall, F1 score, accuracy) for each pipeline, if a test split was performed.
@field cv_mean_score: A data frame summarizing the mean cross-validation scores for each pipeline across all data splits.
@field permutation_importance: A list of permutation importance scores for each pipeline, if permutation importance calculation was enabled. This comprehensive return structure allows for in-depth analysis of the feature selection methods and model performance.
Examples
if (GeneSelectR:::check_python_modules_available(c("numpy", "pandas", "sklearn", 'boruta'))) {
# Create a mock dataset with 29 feature columns and 1 binary label column
set.seed(123) # for reproducibility
n_rows <- 10
n_features <- 100
# Randomly generate feature data
X <- as.data.frame(matrix(rnorm(n_rows * n_features), nrow = n_rows, ncol = n_features))
# Ensure each feature has a variance greater than 0.85
for(i in 1:ncol(X)) {
while(var(X[[i]]) <= 0.85) {
X[[i]] <- X[[i]] * 1.1
}
}
colnames(X) <- paste0("Feature", 1:n_features)
# Create a mock binary label column
y <- factor(sample(c("Class1", "Class2"), n_rows, replace = TRUE))
# Set up the environment
GeneSelectR::configure_environment()
GeneSelectR::set_reticulate_python()
# Run GeneSelectR
results <- GeneSelectR(X, y)
# Perform gene selection and evaluation using user-defined methods
fs_methods <- list("Lasso" = select_model(lasso(penalty = 'l1',
C = 0.1,
solver = 'saga'),
threshold = 'median'))
custom_fs_grids <- list("Lasso" = list('C' = c(0.1, 1, 10)))
results <- GeneSelectR(X,
y,
max_features = 15,
custom_fs_methods = fs_methods,
custom_fs_grids = custom_fs_grids)
} else {
message("Skipping example as not all required Python modules are available.")
}