R: Optimal Cross-Validated Number of Embedding Dimensions

lol.xval.optimal_dimselect {lolR}

R Documentation

Optimal Cross-Validated Number of Embedding Dimensions

Description

A function for performing leave-one-out cross-validation for a given embedding model, that allows users to determine the optimal number of embedding dimensions for their algorithm-of-choice. This function produces fold-wise cross-validated misclassification rates for standard embedding techniques across a specified selection of embedding dimensions. Optimal embedding dimension is selected as the dimension with the lowest average misclassification rate across all folds. Users can optionally specify custom embedding techniques with proper configuration of alg.* parameters and hyperparameters. Optional classifiers implementing the S3 predict function can be used for classification, with hyperparameters to classifiers for determining misclassification rate specified in classifier.*.

Usage

lol.xval.optimal_dimselect(
  X,
  Y,
  rs,
  alg,
  sets = NULL,
  alg.dimname = "r",
  alg.opts = list(),
  alg.embedding = "A",
  alg.structured = TRUE,
  classifier = lda,
  classifier.opts = list(),
  classifier.return = "class",
  k = "loo",
  rank.low = FALSE,
  ...
)

Arguments

`X`	`[n, d]` the data with `n` samples in `d` dimensions.
`Y`	`[n]` the labels of the samples with `K` unique labels. Defaults to `NaN`.#' @param alg.opts any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list. For example, this could be the embedding dimensionality to investigate.
`rs`	`[r.n]` the embedding dimensions to investigate over, where `max(rs) <= d`.
`alg`	the algorithm to use for embedding. Should be a function that accepts inputs `X` and `Y` and embedding dimension `r` if `alg` is supervised, or just `X` and embedding dimension `r` if `alg` is unsupervised.This algorithm should return a list containing a matrix that embeds from d to r < d dimensions.
`sets`	a user-defined cross-validation set. Defaults to `NULL`. `is.null(sets)` randomly partition the inputs `X` and `Y` into training and testing sets. `!is.null(sets)` use a user-defined partitioning of the inputs `X` and `Y` into training and testing sets. Should be in the format of the outputs from `lol.xval.split`. That is, a `list` with each element containing `X.train`, an `[n-k][d]` subset of data to test on, `Y.train`, an `[n-k]` subset of class labels for `X.train`; `X.test`, an `[n-k][d]` subset of data to test the model on, `Y.train`, an `[k]` subset of class labels for `X.test`.
`alg.dimname`	the name of the parameter accepted by `alg` for indicating the embedding dimensionality desired. Defaults to `r`.
`alg.opts`	the hyper-parameter options to pass to your algorithm as a keyworded list. Defaults to `list()`, or no hyper-parameters. This should not include the number of embedding dimensions, `r`, which are passed separately in the `rs` vector.
`alg.embedding`	the attribute returned by `alg` containing the embedding matrix. Defaults to assuming that `alg` returns an embgedding matrix as `"A"`. `!is.nan(alg.embedding)` Assumes that `alg` will return a list containing an attribute, `alg.embedding`, a `[d, r]` matrix that embeds `[n, d]` data from `[d]` to `[r < d]` dimensions. `is.nan(alg.embedding)` Assumes that `alg` returns a `[d, r]` matrix that embeds `[n, d]` data from `[d]` to `[r < d]` dimensions.
`alg.structured`	a boolean to indicate whether the embedding matrix is structured. Provides performance increase by not having to compute the embedding matrix `xv` times if unnecessary. Defaults to `TRUE`. `TRUE` assumes that if `Ar: R^d -> R^r` embeds from `d` to `r` dimensions and `Aq: R^d -> R^q` from `d` to `q > r` dimensions, that `Aq[, 1:r] == Ar`, `TRUE` assumes that if `Ar: R^d -> R^r` embeds from `d` to `r` dimensions and `Aq: R^d -> R^q` from `d` to `q > r` dimensions, that `Aq[, 1:r] != Ar`,
`classifier`	the classifier to use for assessing performance. The classifier should accept `X`, a `[n, d]` array as the first input, and `Y`, a `[n]` array of labels, as the first 2 arguments. The class should implement a predict function, `predict.classifier`, that is compatible with the `stats::predict` `S3` method. Defaults to `MASS::lda`.
`classifier.opts`	any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list.
`classifier.return`	if the return type is a list, `class` encodes the attribute containing the prediction labels from `stats::predict`. Defaults to the return type of `MASS::lda`, `class`. `!is.nan(classifier.return)` Assumes that `predict.classifier` will return a list containing an attribute, `classifier.return`, that encodes the predicted labels. `is.nan(classifier.return)` Assumes that `predict.classifer` returns a `[n]` vector/array containing the prediction labels for `[n, d]` inputs.
`k`	the cross-validated method to perform. Defaults to `'loo'`. If `sets` is provided, this option is ignored. See `lol.xval.split` for details. `'loo'` Leave-one-out cross validation `isinteger(k)` perform `k`-fold cross-validation with `k` as the number of folds.
`rank.low`	whether to force the training set to low-rank. Defaults to `FALSE`. If `sets` is provided, this option is ignored. See `lol.xval.split` for details. if `rank.low == FALSE`, uses default cross-validation method with standard `k`-fold validation. Training sets are `k-1` folds, and testing sets are `1` fold, where the fold held-out for testing is rotated to ensure no dependence of potential downstream inference in the cross-validated misclassification rates. if ]coderank.low == TRUE, users cross-validation method with `ntrain = min((k-1)/kn, d)` sample training sets, where `d` is the number of dimensions in `X`. This ensures that the training data is always low-rank, `ntrain < d + 1`. Note that the resulting training sets may have `ntrain < (k-1)/kn`, but the resulting testing sets will always be properly rotated `ntest = n/k` to ensure no dependencies in fold-wise testing.
`...`	trailing args.

Value

Returns a list containing:

`folds.data`	the results, as a data-frame, of the per-fold classification accuracy.
`foldmeans.data`	the results, as a data-frame, of the average classification accuracy for each `r`.
`optimal.lhat`	the classification error of the optimal `r`

optimal.r

the optimal number of embedding dimensions from rs

`model`	the model trained on all of the data at the optimal number of embedding dimensions.
`classifier`	the classifier trained on all of the data at the optimal number of embedding dimensions.

Details

For more details see the help vignette: vignette("xval", package = "lolR")

For extending cross-validation techniques shown here to arbitrary embedding algorithms, see the vignette: vignette("extend_embedding", package = "lolR")

For extending cross-validation techniques shown here to arbitrary classification algorithms, see the vignette: vignette("extend_classification", package = "lolR")

Author(s)

Eric Bridgeford

Examples

# train model and analyze with loo validation using lda classifier
library(lolR)
data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensions
X <- data$X; Y <- data$Y
# run cross-validation with the nearestCentroid method and
# leave-one-out cross-validation, which returns only
# prediction labels so we specify classifier.return as NaN
xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol,
                          classifier=lol.classify.nearestCentroid,
                          classifier.return=NaN, k='loo')

# train model and analyze with 5-fold validation using lda classifier
data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensions
X <- data$X; Y <- data$Y
xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, k=5)

# pass in existing cross-validation sets
sets <- lol.xval.split(X, Y, k=2)
xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, sets=sets)

[Package lolR version 2.1 Index]