R: Over-sampling, under-sampling, combination of over- and...

ovun.sample {ROSE}

R Documentation

Over-sampling, under-sampling, combination of over- and under-sampling.

Description

Creates possibly balanced samples by random over-sampling minority examples, under-sampling majority examples or combination of over- and under-sampling.

Usage

ovun.sample(formula, data, method="both", N, p=0.5, 
            subset=options("subset")$subset,
            na.action=options("na.action")$na.action, seed)

Arguments

`formula`	An object of class `formula` (or one that can be coerced to that class). See `ROSE` for information about interaction among predictors or their transformations.
`data`	An optional data frame, list or environment (or object coercible to a data frame by `as.data.frame`) in which to preferentially interpret “formula”. If not specified, the variables are taken from “environment(formula)”.
`method`	One among `c("over", "under", "both")` to perform over-sampling minority examples, under-sampling majority examples or combination of over- and under-sampling, respectively.
`N`	The desired sample size of the resulting data set. If missing and `method` is either `"over"` or `"under"` the sample size is determined by oversampling or, respectively, undersampling examples so that the minority class occurs approximately in proportion `p`. When `method = "both"` the default value is given by the length of vectors specified in `formula`.
`p`	The probability of resampling from the rare class. If missing and `method` is either `"over"` or `"under"` this proportion is determined by oversampling or, respectively, undersampling examples so that the sample size is equal to `N`. When `method ="both"` the default value given by 0.5.
`subset`	An optional vector specifying a subset of observations to be used in the sampling process. The default is set by the `subset` setting of `options`.
`na.action`	A function which indicates what should happen when the data contain 'NA's. The default is set by the `na.action` setting of `options`.
`seed`	A single value, interpreted as an integer, recommended to specify seeds and keep trace of the sample.

Value

The value is an object of class ovun.sample which has components

`Call`	The matched call.
`method`	The method used to balance the sample. Possible choices are `c("over", "under", "both")`.
`data`	The resulting new data set.

Examples


# 2-dimensional example
# loading data
data(hacide)

# imbalance on training set
table(hacide.train$cls)

# balanced data set with both over and under sampling
data.balanced.ou <- ovun.sample(cls~., data=hacide.train,
                                N=nrow(hacide.train), p=0.5, 
                                seed=1, method="both")$data

table(data.balanced.ou$cls)

# balanced data set with over-sampling
data.balanced.over <- ovun.sample(cls~., data=hacide.train, 
                                  p=0.5, seed=1, 
                                  method="over")$data

table(data.balanced.over$cls)