GeneSelectR {GeneSelectR}R Documentation

Gene Selection and Evaluation with GeneSelectR

Description

This function performs gene selection using different methods on a given training set and evaluates their performance using cross-validation. Optionally, it also calculates permutation feature importances.

Usage

GeneSelectR(
  X,
  y,
  pipelines = NULL,
  custom_fs_methods = NULL,
  selected_methods = NULL,
  custom_fs_grids = NULL,
  classifier = NULL,
  classifier_grid = NULL,
  preprocessing_steps = NULL,
  testsize = 0.2,
  validsize = 0.2,
  scoring = "accuracy",
  njobs = -1,
  n_splits = 5,
  search_type = "random",
  n_iter = 10,
  max_features = 50,
  calculate_permutation_importance = FALSE,
  perform_test_split = FALSE,
  random_state = NULL
)

Arguments

X

A matrix or data frame with features as columns and observations as rows.

y

A vector of labels corresponding to the rows of X_train.

pipelines

An optional list of pre-defined pipelines to use for fitting and evaluation. If this argument is provided, the feature selection methods and preprocessing steps will be ignored.

custom_fs_methods

An optional list of feature selection methods to use for fitting and evaluation. If this argument is not provided, a default set of feature selection methods will be used.

selected_methods

An optional vector of names of feature selection methods to use from the default set. If this argument is provided, only the specified methods will be used.

custom_fs_grids

An optional list of hyperparameter grids for the feature selection methods. Each element of the list should be a named list of parameters for a specific feature selection method. The names of the elements should match the names of the feature selection methods. If this argument is provided, the function will perform hyperparameter tuning for the specified feature selection methods in addition to the final estimator.

classifier

An optional sklearn classifier. If left NULL then sklearn RandomForestClassifier is used.

classifier_grid

An optional named list of classifier parameters. If none are provided then default grid is used (check vignette for exact params).

preprocessing_steps

An optional named list of sklearn preprocessing procedures. If none provided defaults are used (check vignette for exact params).

testsize

The size of the test set used in the evaluation.

validsize

The size of the validation set used in the evaluation.

scoring

A string representing what scoring metric to use for hyperparameter adjustment. Default value is 'accuracy'

njobs

Number of jobs to run in parallel.

n_splits

Number of train/test splits.

search_type

A string indicating the type of search to use. 'grid' for GridSearchCV and 'random' for RandomizedSearchCV. Default is 'random'.

n_iter

An integer indicating the number of parameter settings that are sampled in RandomizedSearchCV. Only applies when search_type is 'random'.

max_features

Maximum number of features to be selected by default feature selection methods. Max features cannot exceed the total number of features in a dataset.

calculate_permutation_importance

A boolean indicating whether to calculate permutation feature importance. Default is FALSE.

perform_test_split

Whether to perform train and test split, to have an evaluation on unseen test set. The default value is set to FALSE

random_state

An integer value setting the random seed for feature selection algorithms and cross validation procedure. By default set to NULL to use different random seed every time an algorithm is used. For reproducibility could be fixed, otherwise for an unbiased estimation should be left as NULL.

Value

Returns an object of class PipelineResults with the following elements:

Examples


if (GeneSelectR:::check_python_modules_available(c("numpy", "pandas", "sklearn", 'boruta'))) {
  # Create a mock dataset with 29 feature columns and 1 binary label column
  set.seed(123) # for reproducibility
  n_rows <- 10
  n_features <- 100

  # Randomly generate feature data
  X <- as.data.frame(matrix(rnorm(n_rows * n_features), nrow = n_rows, ncol = n_features))
  # Ensure each feature has a variance greater than 0.85
  for(i in 1:ncol(X)) {
    while(var(X[[i]]) <= 0.85) {
      X[[i]] <- X[[i]] * 1.1
    }
  }
  colnames(X) <- paste0("Feature", 1:n_features)

  # Create a mock binary label column
  y <- factor(sample(c("Class1", "Class2"), n_rows, replace = TRUE))

  # Set up the environment
  GeneSelectR::configure_environment()
  GeneSelectR::set_reticulate_python()

  # Run GeneSelectR
  results <- GeneSelectR(X, y)

  # Perform gene selection and evaluation using user-defined methods
  fs_methods <- list("Lasso" = select_model(lasso(penalty = 'l1',
                                                  C = 0.1,
                                                  solver = 'saga'),
                                            threshold = 'median'))
  custom_fs_grids <- list("Lasso" = list('C' = c(0.1, 1, 10)))
  results <- GeneSelectR(X,
                         y,
                         max_features = 15,
                         custom_fs_methods = fs_methods,
                         custom_fs_grids = custom_fs_grids)
} else {
  message("Skipping example as not all required Python modules are available.")
}


[Package GeneSelectR version 1.0.1 Index]