R: Pre-compute feature information

precompute_feature_info {familiar}

R Documentation

Pre-compute feature information

Description

Creates data assignment and subsequently extracts feature information such as normalisation and clustering parameters.

Usage

precompute_feature_info(
  formula = NULL,
  data = NULL,
  experiment_data = NULL,
  cl = NULL,
  experimental_design = "fs+mb",
  verbose = TRUE,
  ...
)

Arguments

`formula`	An R formula. The formula can only contain feature names and dot (`.`). The `*` and `+1` operators are not supported as these refer to columns that are not present in the data set. Use of the formula interface is optional.
`data`	A `data.table` object, a `data.frame` object, list containing multiple `data.table` or `data.frame` objects, or paths to data files. `data` should be provided if no file paths are provided to the `data_files` argument. If both are provided, only `data` will be used. All data is expected to be in wide format, and ideally has a sample identifier (see `sample_id_column`), batch identifier (see `cohort_column`) and outcome columns (see `outcome_column`). In case paths are provided, the data should be stored as `csv`, `rds` or `RData` files. See documentation for the `data_files` argument for more information.
`experiment_data`	Experimental data may provided in the form of
`cl`	Cluster created using the `parallel` package. This cluster is then used to speed up computation through parallelisation. When a cluster is not provided, parallelisation is performed by setting up a cluster on the local machine. This parameter has no effect if the `parallel` argument is set to `FALSE`.
`experimental_design`	(required) Defines what the experiment looks like, e.g. `cv(bt(fs,20)+mb,3,2)` for 2 times repeated 3-fold cross-validation with nested feature selection on 20 bootstraps and model-building. The basic workflow components are: `fs`: (required) feature selection step. `mb`: (required) model building step. `ev`: (optional) external validation. If validation batches or cohorts are present in the dataset (`data`), these should be indicated in the `validation_batch_id` argument. The different components are linked using `+`. Different subsampling methods can be used in conjunction with the basic workflow components: `bs(x,n)`: (stratified) .632 bootstrap, with `n` the number of bootstraps. In contrast to `bt`, feature pre-processing parameters and hyperparameter optimisation are conducted on individual bootstraps. `bt(x,n)`: (stratified) .632 bootstrap, with `n` the number of bootstraps. Unlike `bs` and other subsampling methods, no separate pre-processing parameters or optimised hyperparameters will be determined for each bootstrap. `cv(x,n,p)`: (stratified) `n`-fold cross-validation, repeated `p` times. Pre-processing parameters are determined for each iteration. `lv(x)`: leave-one-out-cross-validation. Pre-processing parameters are determined for each iteration. `ip(x)`: imbalance partitioning for addressing class imbalances on the data set. Pre-processing parameters are determined for each partition. The number of partitions generated depends on the imbalance correction method (see the `imbalance_correction_method` parameter). As shown in the example above, sampling algorithms can be nested. Though neither variable importance is determined nor models are learned within `precompute_feature_info`, the corresponding elements are still required to prevent issues when using the resulting `experimentData` object to warm-start the experiments. The simplest valid experimental design is `fs+mb`. This is the default in `precompute_feature_info`, and will determine feature parameters over the entire dataset. This argument is ignored if the `experiment_data` argument is set.
`verbose`	Indicates verbosity of the results. Default is TRUE, and all messages and warnings are returned.
`...`	Arguments passed on to `.parse_experiment_settings`, `.parse_setup_settings`, `.parse_preprocessing_settings` `batch_id_column` (recommended) Name of the column containing batch or cohort identifiers. This parameter is required if more than one dataset is provided, or if external validation is performed. In familiar any row of data is organised by four identifiers: The batch identifier `batch_id_column`: This denotes the group to which a set of samples belongs, e.g. patients from a single study, samples measured in a batch, etc. The batch identifier is used for batch normalisation, as well as selection of development and validation datasets. The sample identifier `sample_id_column`: This denotes the sample level, e.g. data from a single individual. Subsets of data, e.g. bootstraps or cross-validation folds, are created at this level. The series identifier `series_id_column`: Indicates measurements on a single sample that may not share the same outcome value, e.g. a time series, or the number of cells in a view. The repetition identifier: Indicates repeated measurements in a single series where any feature values may differ, but the outcome does not. Repetition identifiers are always implicitly set when multiple entries for the same series of the same sample in the same batch that share the same outcome are encountered. `sample_id_column` (recommended) Name of the column containing sample or subject identifiers. See `batch_id_column` above for more details. If unset, every row will be identified as a single sample. `series_id_column` (optional) Name of the column containing series identifiers, which distinguish between measurements that are part of a series for a single sample. See `batch_id_column` above for more details. If unset, rows which share the same batch and sample identifiers but have a different outcome are assigned unique series identifiers. `development_batch_id` (optional) One or more batch or cohort identifiers to constitute data sets for development. Defaults to all, or all minus the identifiers in `validation_batch_id` for external validation. Required if external validation is performed and `validation_batch_id` is not provided. `validation_batch_id` (optional) One or more batch or cohort identifiers to constitute data sets for external validation. Defaults to all data sets except those in `development_batch_id` for external validation, or none if not. Required if `development_batch_id` is not provided. `outcome_name` (optional) Name of the modelled outcome. This name will be used in figures created by `familiar`. If not set, the column name in `outcome_column` will be used for `binomial`, `multinomial`, `count` and `continuous` outcomes. For other outcomes (`survival` and `competing_risk`) no default is used. `outcome_column` (recommended) Name of the column containing the outcome of interest. May be identified from a formula, if a formula is provided as an argument. Otherwise an error is raised. Note that `survival` and `competing_risk` outcome type outcomes require two columns that indicate the time-to-event or the time of last follow-up and the event status. `outcome_type` (recommended) Type of outcome found in the outcome column. The outcome type determines many aspects of the overall process, e.g. the available feature selection methods and learners, but also the type of assessments that can be conducted to evaluate the resulting models. Implemented outcome types are: `binomial`: categorical outcome with 2 levels. `multinomial`: categorical outcome with 2 or more levels. `count`: Poisson-distributed numeric outcomes. `continuous`: general continuous numeric outcomes. `survival`: survival outcome for time-to-event data. If not provided, the algorithm will attempt to obtain outcome_type from contents of the outcome column. This may lead to unexpected results, and we therefore advise to provide this information manually. Note that `competing_risk` survival analysis are not fully supported, and is currently not a valid choice for `outcome_type`. `class_levels` (optional) Class levels for `binomial` or `multinomial` outcomes. This argument can be used to specify the ordering of levels for categorical outcomes. These class levels must exactly match the levels present in the outcome column. `event_indicator` (recommended) Indicator for events in `survival` and `competing_risk` analyses. `familiar` will automatically recognise `1`, `true`, `t`, `y` and `yes` as event indicators, including different capitalisations. If this parameter is set, it replaces the default values. `censoring_indicator` (recommended) Indicator for right-censoring in `survival` and `competing_risk` analyses. `familiar` will automatically recognise `0`, `false`, `f`, `n`, `no` as censoring indicators, including different capitalisations. If this parameter is set, it replaces the default values. `competing_risk_indicator` (recommended) Indicator for competing risks in `competing_risk` analyses. There are no default values, and if unset, all values other than those specified by the `event_indicator` and `censoring_indicator` parameters are considered to indicate competing risks. `signature` (optional) One or more names of feature columns that are considered part of a specific signature. Features specified here will always be used for modelling. Ranking from feature selection has no effect for these features. `novelty_features` (optional) One or more names of feature columns that should be included for the purpose of novelty detection. `exclude_features` (optional) Feature columns that will be removed from the data set. Cannot overlap with features in `signature`, `novelty_features` or `include_features`. `include_features` (optional) Feature columns that are specifically included in the data set. By default all features are included. Cannot overlap with `exclude_features`, but may overlap `signature`. Features in `signature` and `novelty_features` are always included. If both `exclude_features` and `include_features` are provided, `include_features` takes precedence, provided that there is no overlap between the two. `reference_method` (optional) Method used to set reference levels for categorical features. There are several options: `auto` (default): Categorical features that are not explicitly set by the user, i.e. columns containing boolean values or characters, use the most frequent level as reference. Categorical features that are explicitly set, i.e. as factors, are used as is. `always`: Both automatically detected and user-specified categorical features have the reference level set to the most frequent level. Ordinal features are not altered, but are used as is. `never`: User-specified categorical features are used as is. Automatically detected categorical features are simply sorted, and the first level is then used as the reference level. This was the behaviour prior to familiar version 1.3.0. `imbalance_correction_method` (optional) Type of method used to address class imbalances. Available options are: `full_undersampling` (default): All data will be used in an ensemble fashion. The full minority class will appear in each partition, but majority classes are undersampled until all data have been used. `random_undersampling`: Randomly undersamples majority classes. This is useful in cases where full undersampling would lead to the formation of many models due major overrepresentation of the largest class. This parameter is only used in combination with imbalance partitioning in the experimental design, and `ip` should therefore appear in the string that defines the design. `imbalance_n_partitions` (optional) Number of times random undersampling should be repeated. 10 undersampled subsets with balanced classes are formed by default. `parallel` (optional) Enable parallel processing. Defaults to `TRUE`. When set to `FALSE`, this disables all parallel processing, regardless of specific parameters such as `parallel_preprocessing`. However, when `parallel` is `TRUE`, parallel processing of different parts of the workflow can be disabled by setting respective flags to `FALSE`. `parallel_nr_cores` (optional) Number of cores available for parallelisation. Defaults to 2. This setting does nothing if parallelisation is disabled. `restart_cluster` (optional) Restart nodes used for parallel computing to free up memory prior to starting a parallel process. Note that it does take time to set up the clusters. Therefore setting this argument to `TRUE` may impact processing speed. This argument is ignored if `parallel` is `FALSE` or the cluster was initialised outside of familiar. Default is `FALSE`, which causes the clusters to be initialised only once. `cluster_type` (optional) Selection of the cluster type for parallel processing. Available types are the ones supported by the parallel package that is part of the base R distribution: `psock` (default), `fork`, `mpi`, `nws`, `sock`. In addition, `none` is available, which also disables parallel processing. `backend_type` (optional) Selection of the backend for distributing copies of the data. This backend ensures that only a single master copy is kept in memory. This limits memory usage during parallel processing. Several backend options are available, notably `socket_server`, and `none` (default). `socket_server` is based on the callr package and R sockets, comes with `familiar` and is available for any OS. `none` uses the package environment of familiar to store data, and is available for any OS. However, `none` requires copying of data to any parallel process, and has a larger memory footprint. `server_port` (optional) Integer indicating the port on which the socket server or RServe process should communicate. Defaults to port 6311. Note that ports 0 to 1024 and 49152 to 65535 cannot be used. `feature_max_fraction_missing` (optional) Numeric value between `0.0` and `0.95` that determines the meximum fraction of missing values that still allows a feature to be included in the data set. All features with a missing value fraction over this threshold are not processed further. The default value is `0.30`. `sample_max_fraction_missing` (optional) Numeric value between `0.0` and `0.95` that determines the maximum fraction of missing values that still allows a sample to be included in the data set. All samples with a missing value fraction over this threshold are excluded and not processed further. The default value is `0.30`. `filter_method` (optional) One or methods used to reduce dimensionality of the data set by removing irrelevant or poorly reproducible features. Several method are available: `none` (default): None of the features will be filtered. `low_variance`: Features with a variance below the `low_var_minimum_variance_threshold` are filtered. This can be useful to filter, for example, genes that are not differentially expressed. `univariate_test`: Features undergo a univariate regression using an outcome-appropriate regression model. The p-value of the model coefficient is collected. Features with coefficient p or q-value above the `univariate_test_threshold` are subsequently filtered. `robustness`: Features that are not sufficiently robust according to the intraclass correlation coefficient are filtered. Use of this method requires that repeated measurements are present in the data set, i.e. there should be entries for which the sample and cohort identifiers are the same. More than one method can be used simultaneously. Features with singular values are always filtered, as these do not contain information. `univariate_test_threshold` (optional) Numeric value between `1.0` and `0.0` that determines which features are irrelevant and will be filtered by the `univariate_test`. The p or q-values are compared to this threshold. All features with values above the threshold are filtered. The default value is `0.20`. `univariate_test_threshold_metric` (optional) Metric used with the to compare the `univariate_test_threshold` against. The following metrics can be chosen: `p_value` (default): The unadjusted p-value of each feature is used for to filter features. `q_value`: The q-value (Story, 2002), is used to filter features. Some data sets may have insufficient samples to compute the q-value. The `qvalue` package must be installed from Bioconductor to use this method. `univariate_test_max_feature_set_size` (optional) Maximum size of the feature set after the univariate test. P or q values of features are compared against the threshold, but if the resulting data set would be larger than this setting, only the most relevant features up to the desired feature set size are selected. The default value is `NULL`, which causes features to be filtered based on their relevance only. `low_var_minimum_variance_threshold` (required, if used) Numeric value that determines which features will be filtered by the `low_variance` method. The variance of each feature is computed and compared to the threshold. If it is below the threshold, the feature is removed. This parameter has no default value and should be set if `low_variance` is used. `low_var_max_feature_set_size` (optional) Maximum size of the feature set after filtering features with a low variance. All features are first compared against `low_var_minimum_variance_threshold`. If the resulting feature set would be larger than specified, only the most strongly varying features will be selected, up to the desired size of the feature set. The default value is `NULL`, which causes features to be filtered based on their variance only. `robustness_icc_type` (optional) String indicating the type of intraclass correlation coefficient (`1`, `2` or `3`) that should be used to compute robustness for features in repeated measurements. These types correspond to the types in Shrout and Fleiss (1979). The default value is `1`. `robustness_threshold_metric` (optional) String indicating which specific intraclass correlation coefficient (ICC) metric should be used to filter features. This should be one of: `icc`: The estimated ICC value itself. `icc_low` (default): The estimated lower limit of the 95% confidence interval of the ICC, as suggested by Koo and Li (2016). `icc_panel`: The estimated ICC value over the panel average, i.e. the ICC that would be obtained if all repeated measurements were averaged. `icc_panel_low`: The estimated lower limit of the 95% confidence interval of the panel ICC. `robustness_threshold_value` (optional) The intraclass correlation coefficient value that is as threshold. The default value is `0.70`. `transformation_method` (optional) The transformation method used to change the distribution of the data to be more normal-like. The following methods are available: `none`: This disables transformation of features. `yeo_johnson` (default): Transformation using the Yeo-Johnson transformation (Yeo and Johnson, 2000). The algorithm tests various lambda values and selects the lambda that maximises the log-likelihood. `yeo_johnson_trim`: As `yeo_johnson`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `yeo_johnson_winsor`: As `yeo_johnson`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `yeo_johnson_robust`: A robust version of `yeo_johnson` after Raymaekers and Rousseeuw (2021). This method is less sensitive to outliers. `box_cox`: Transformation using the Box-Cox transformation (Box and Cox, 1964). Unlike the Yeo-Johnson transformation, the Box-Cox transformation requires that all data are positive. Features that contain zero or negative values cannot be transformed using this transformation. The algorithm tests various lambda values and selects the lambda that maximises the log-likelihood. `box_cox_trim`: As `box_cox`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `box_cox_winsor`: As `box_cox`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `box_cox_robust`: A robust verson of `box_cox` after Raymaekers and Rousseew (2021). This method is less sensitive to outliers. Only features that contain numerical data are transformed. Transformation parameters obtained in development data are stored within `featureInfo` objects for later use with validation data sets. `normalisation_method` (optional) The normalisation method used to improve the comparability between numerical features that may have very different scales. The following normalisation methods can be chosen: `none`: This disables feature normalisation. `standardisation`: Features are normalised by subtraction of their mean values and division by their standard deviations. This causes every feature to be have a center value of 0.0 and standard deviation of 1.0. `standardisation_trim`: As `standardisation`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `standardisation_winsor`: As `standardisation`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `standardisation_robust` (default): A robust version of `standardisation` that relies on computing Huber's M-estimators for location and scale. `normalisation`: Features are normalised by subtraction of their minimum values and division by their ranges. This maps all feature values to a `[0, 1]` interval. `normalisation_trim`: As `normalisation`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `normalisation_winsor`: As `normalisation`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `quantile`: Features are normalised by subtraction of their median values and division by their interquartile range. `mean_centering`: Features are centered by substracting the mean, but do not undergo rescaling. Only features that contain numerical data are normalised. Normalisation parameters obtained in development data are stored within `featureInfo` objects for later use with validation data sets. `batch_normalisation_method` (optional) The method used for batch normalisation. Available methods are: `none` (default): This disables batch normalisation of features. `standardisation`: Features within each batch are normalised by subtraction of the mean value and division by the standard deviation in each batch. `standardisation_trim`: As `standardisation`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `standardisation_winsor`: As `standardisation`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `standardisation_robust`: A robust version of `standardisation` that relies on computing Huber's M-estimators for location and scale within each batch. `normalisation`: Features within each batch are normalised by subtraction of their minimum values and division by their range in each batch. This maps all feature values in each batch to a `[0, 1]` interval. `normalisation_trim`: As `normalisation`, but based on the set of feature values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers. `normalisation_winsor`: As `normalisation`, but based on the set of feature values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers. `quantile`: Features in each batch are normalised by subtraction of the median value and division by the interquartile range of each batch. `mean_centering`: Features in each batch are centered on 0.0 by substracting the mean value in each batch, but are not rescaled. `combat_parametric`: Batch adjustments using parametric empirical Bayes (Johnson et al, 2007). `combat_p` leads to the same method. `combat_non_parametric`: Batch adjustments using non-parametric empirical Bayes (Johnson et al, 2007). `combat_np` and `combat` lead to the same method. Note that we reduced complexity from O(`n^2`) to O(`n`) by only computing batch adjustment parameters for each feature on a subset of 50 randomly selected features, instead of all features. Only features that contain numerical data are normalised using batch normalisation. Batch normalisation parameters obtained in development data are stored within `featureInfo` objects for later use with validation data sets, in case the validation data is from the same batch. If validation data contains data from unknown batches, normalisation parameters are separately determined for these batches. Note that for both empirical Bayes methods, the batch effect is assumed to produce results across the features. This is often true for things such as gene expressions, but the assumption may not hold generally. When performing batch normalisation, it is moreover important to check that differences between batches or cohorts are not related to the studied endpoint. `imputation_method` (optional) Method used for imputing missing feature values. Two methods are implemented: `simple`: Simple replacement of a missing value by the median value (for numeric features) or the modal value (for categorical features). `lasso`: Imputation of missing value by lasso regression (using `glmnet`) based on information contained in other features. `simple` imputation precedes `lasso` imputation to ensure that any missing values in predictors required for `lasso` regression are resolved. The `lasso` estimate is then used to replace the missing value. The default value depends on the number of features in the dataset. If the number is lower than 100, `lasso` is used by default, and `simple` otherwise. Only single imputation is performed. Imputation models and parameters are stored within `featureInfo` objects for later use with validation data sets. `cluster_method` (optional) Clustering is performed to identify and replace redundant features, for example those that are highly correlated. Such features do not carry much additional information and may be removed or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011). The cluster method determines the algorithm used to form the clusters. The following cluster methods are implemented: `none`: No clustering is performed. `hclust` (default): Hierarchical agglomerative clustering. If the `fastcluster` package is installed, `fastcluster::hclust` is used (Muellner 2013), otherwise `stats::hclust` is used. `agnes`: Hierarchical clustering using agglomerative nesting (Kaufman and Rousseeuw, 1990). This algorithm is similar to `hclust`, but uses the `cluster::agnes` implementation. `diana`: Divisive analysis hierarchical clustering. This method uses divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990). `cluster::diana` is used. `pam`: Partioning around medioids. This partitions the data into $k$ clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected using the `silhouette` metric. `pam` is implemented using the `cluster::pam` function. Clusters and cluster information is stored within `featureInfo` objects for later use with validation data sets. This enables reproduction of the same clusters as formed in the development data set. `cluster_linkage_method` (optional) Linkage method used for agglomerative clustering in `hclust` and `agnes`. The following linkage methods can be used: `average` (default): Average linkage. `single`: Single linkage. `complete`: Complete linkage. `weighted`: Weighted linkage, also known as McQuitty linkage. `ward`: Linkage using Ward's minimum variance method. `diana` and `pam` do not require a linkage method. `cluster_cut_method` (optional) The method used to define the actual clusters. The following methods can be used: `silhouette`: Clusters are formed based on the silhouette score (Rousseeuw, 1987). The average silhouette score is computed from 2 to `n` clusters, with `n` the number of features. Clusters are only formed if the average silhouette exceeds 0.50, which indicates reasonable evidence for structure. This procedure may be slow if the number of features is large (>100s). `fixed_cut`: Clusters are formed by cutting the hierarchical tree at the point indicated by the `cluster_similarity_threshold`, e.g. where features in a cluster have an average Spearman correlation of 0.90. `fixed_cut` is only available for `agnes`, `diana` and `hclust`. `dynamic_cut`: Dynamic cluster formation using the cutting algorithm in the `dynamicTreeCut` package. This package should be installed to select this option. `dynamic_cut` can only be used with `agnes` and `hclust`. The default options are `silhouette` for partioning around medioids (`pam`) and `fixed_cut` otherwise. `cluster_similarity_metric` (optional) Clusters are formed based on feature similarity. All features are compared in a pair-wise fashion to compute similarity, for example correlation. The resulting similarity grid is converted into a distance matrix that is subsequently used for clustering. The following metrics are supported to compute pairwise similarities: `mutual_information` (default): normalised mutual information. `mcfadden_r2`: McFadden's pseudo R-squared (McFadden, 1974). `cox_snell_r2`: Cox and Snell's pseudo R-squared (Cox and Snell, 1989). `nagelkerke_r2`: Nagelkerke's pseudo R-squared (Nagelkerke, 1991). `spearman`: Spearman's rank order correlation. `kendall`: Kendall rank correlation. `pearson`: Pearson product-moment correlation. The pseudo R-squared metrics can be used to assess similarity between mixed pairs of numeric and categorical features, as these are based on the log-likelihood of regression models. In `familiar`, the more informative feature is used as the predictor and the other feature as the reponse variable. In numeric-categorical pairs, the numeric feature is considered to be more informative and is thus used as the predictor. In categorical-categorical pairs, the feature with most levels is used as the predictor. In case any of the classical correlation coefficients (`pearson`, `spearman` and `kendall`) are used with (mixed) categorical features, the categorical features are one-hot encoded and the mean correlation over all resulting pairs is used as similarity. `cluster_similarity_threshold` (optional) The threshold level for pair-wise similarity that is required to form clusters using `fixed_cut`. This should be a numerical value between 0.0 and 1.0. Note however, that a reasonable threshold value depends strongly on the similarity metric. The following are the default values used: `mcfadden_r2` and `mutual_information`: `0.30` `cox_snell_r2` and `nagelkerke_r2`: `0.75` `spearman`, `kendall` and `pearson`: `0.90` Alternatively, if the `⁠fixed cut⁠` method is not used, this value determines whether any clustering should be performed, because the data may not contain highly similar features. The default values in this situation are: `mcfadden_r2` and `mutual_information`: `0.25` `cox_snell_r2` and `nagelkerke_r2`: `0.40` `spearman`, `kendall` and `pearson`: `0.70` The threshold value is converted to a distance (1-similarity) prior to cutting hierarchical trees. `cluster_representation_method` (optional) Method used to determine how the information of co-clustered features is summarised and used to represent the cluster. The following methods can be selected: `best_predictor` (default): The feature with the highest importance according to univariate regression with the outcome is used to represent the cluster. `medioid`: The feature closest to the cluster center, i.e. the feature that is most similar to the remaining features in the cluster, is used to represent the feature. `mean`: A meta-feature is generated by averaging the feature values for all features in a cluster. This method aligns all features so that all features will be positively correlated prior to averaging. Should a cluster contain one or more categorical features, the `medioid` method will be used instead, as averaging is not possible. Note that if this method is chosen, the `normalisation_method` parameter should be one of `standardisation`, `standardisation_trim`, `standardisation_winsor` or `quantile`.' If the `pam` cluster method is selected, only the `medioid` method can be used. In that case 1 medioid is used by default. `parallel_preprocessing` (optional) Enable parallel processing for the preprocessing workflow. Defaults to `TRUE`. When set to `FALSE`, this will disable the use of parallel processing while preprocessing, regardless of the settings of the `parallel` parameter. `parallel_preprocessing` is ignored if `parallel=FALSE`.

Details

This is a thin wrapper around summon_familiar, and functions like it, but automatically skips computation of variable importance, learning and subsequent evaluation steps.

The function returns an experimentData object, which can be used to warm-start other experiments by providing it to the experiment_data argument.

Value

An experimentData object.

[Package familiar version 1.4.8 Index]