opdisDownsampling {opdisDownsampling}R Documentation

Optimal Distribution Preserving Down-Sampling of Bio-Medical Data

Description

The package provides the necessary functions for optimal distribution-preserving down-sampling of large (bio-medical) data sets.

Usage

opdisDownsampling(Data, Cls, Size, Seed, nTrials = 1000,
TestStat = "ad", MaxCores = getOption("mc.cores", 2L), PCAimportance = FALSE)

Arguments

Data

the (numerical!) data as a vector, matrix or data frame.

Cls

the class information, if any, as a vector of similar length as instances in the data.

Size

the total number of instances across all classes to be drawn.

Seed

a predefined seed to modify the results.

nTrials

how many samples to choose from should be randomly drawn.

TestStat

statistical criterion for similarity judgment.

MaxCores

maximum number of cpu cores to use for parallel computing.

PCAimportance

PCA based feature selection; only variables important in PCA projection are considered.

Value

Returns a list of data containing the drawn samples and the omitted data.

ReducedData

the selected sample data and class information.

ReducedData

the not-selected sample data and class information.

ReducedInstances

the instance numbers of the selected sample data.

Author(s)

Jorn Lotsch

References

Lotsch, J., Malkusch, S., Ultsch, A. (2021): Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling). PLoS One. 2021 Aug 5;16(8):e0255838. doi: 10.1371/journal.pone.0255838. eCollection 2021.

Examples

## example 1
data(iris)
Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species),
  Size = 50, MaxCores = 1)

[Package opdisDownsampling version 1.0.1 Index]