R: Covering Distance Filtering for Regression

rfCDF {rgnoisefilt}

R Documentation

Covering Distance Filtering for Regression

Description

Application of the rfCDF noise filtering method in a regression dataset.

Usage

## Default S3 method:
rfCDF(x, y, subsets = 5, VCdim = 0.1 * nrow(x), prob = 0.05, ...)

## S3 method for class 'formula'
rfCDF(formula, data, ...)

Arguments

`x`	a data frame of input attributes.
`y`	a double vector with the output regressand of each sample.
`subsets`	an integer with the number of subsets to be used (default: 5).
`VCdim`	an integer specifying the VC-dimension (default: 0.1*`nrow(x)`).
`prob`	a double with the probability used in the filtering process (default: 0.05).
`...`	other options to pass to the function.
`formula`	a formula with the output regressand and, at least, one input attribute.
`data`	a data frame in which to interpret the variables in the formula.

Details

CDF divides the dataset into two subsets, Din and Dout, which represent samples within and outside the covering interval, respectively. Samples in Din are considered to have low noise and are retained in the final clean set of samples. Then, the noise of each sample is estimated using the Covering Distance function. Samples in Dout can be removed one by one based on their absolute noise, with samples exhibiting larger noise removed first. Each time a new sample is removed, an objective function can be estimated. Finally, the removing operation is stopped at the maximum value of the objective function.

Value

The result of applying the regression filter is a reduced dataset containing the clean samples (without errors or noise), since it removes noisy samples (those with errors). This function returns an object of class rfdata, which contains information related to the noise filtering process in the form of a list with the following elements:

`xclean`	a data frame with the input attributes of clean samples (without errors).
`yclean`	a double vector with the output regressand of clean samples (without errors).
`numclean`	an integer with the amount of clean samples.
`idclean`	an integer vector with the indices of clean samples.
`xnoise`	a data frame with the input attributes of noisy samples (with errors).
`ynoise`	a double vector with the output regressand of noisy samples (with errors).
`numnoise`	an integer with the amount of noisy samples.
`idnoise`	an integer vector with the indices of noisy samples.
`filter`	the full name of the noise filter used.
`param`	a list of the argument values.
`call`	the function call.

Note that objects of the class rfdata support print.rfdata, summary.rfdata and plot.rfdata methods.

References

G. Jiang, W. Wang, Y. Qian, J. Liang, A Unified Sample Selection Framework for Output Noise Filtering: An Error-Bound Perspective. Journal of Machine Learning Research, 22:1–65, 2021.

Examples

# load the dataset
data(rock)

# usage of the default method
set.seed(9)
out.def <- rfCDF(x = rock[,-ncol(rock)], y = rock[,ncol(rock)])

# show results
summary(out.def, showid = TRUE)

# usage of the method for class formula
set.seed(9)
out.frm <- rfCDF(formula = perm ~ ., data = rock)

# check the match of noisy indices
all(out.def$idnoise == out.frm$idnoise)

[Package rgnoisefilt version 1.1.2 Index]