discretization {abn} | R Documentation |

## Discretization of a Possibly Continuous Data Frame of Random Variables based on their distribution

### Description

This function discretizes a data frame of possibly continuous random variables through rules for discretization. The discretization algorithms are unsupervised and univariate. See details for the complete list of discretization rules (the number of state of each random variable could also be provided).

### Usage

```
discretization(data.df = NULL,
data.dists = NULL,
discretization.method = "sturges",
nb.states = FALSE)
```

### Arguments

`data.df` |
a data frame containing the data to discretize, binary and multinomial variables must be declared as factors, others as a numeric vector. The data frame must be named. |

`data.dists` |
a named list giving the distribution for each node in the network. |

`discretization.method` |
a character vector giving the discretization method to use; see details. If a number is provided, the variable will be discretized by equal binning. |

`nb.states` |
logical variable to select the output. If set to |

### Details

`fd`

Freedman Diaconis rule. `IQR()`

stands for interquartile range.
The number of bins is given by

`\frac{range(x) * n^{1/3}}{2 * IQR(x)}`

The Freedman Diaconis rule is known to be less sensitive than the Scott's rule to outlier.

`doane`

Doane's rule.
The number of bins is given by

`1 + \log_{2}{n} + \log_{2}{1+\frac{|g|}{\sigma_{g}}}`

This is a modification of Sturges' formula, which attempts to improve its performance with non-normal data.

`sqrt`

The number of bins is given by:

`\sqrt(n)`

`cencov`

Cencov's rule.
The number of bins is given by:

`n^{1/3}`

`rice`

Rice' rule.
The number of bins is given by:

`2 n^{1/3}`

`terrell-scott`

Terrell-Scott's rule.
The number of bins is given by:

`(2 n)^{1/3}`

It is known that Cencov, Rice, and Terrell-Scott rules over-estimates k, compared to other rules due to its simplicity.

`sturges`

Sturges's rule.
The number of bins is given by:

`1 + \log_2(n)`

`scott`

Scott's rule.
The number of bins is given by:

`range(x) / \sigma(x) n^{-1/3}`

### Value

The discretized data frame or a list containing the table of counts for each bin the discretized data frame.

table of counts for each bin of the discretized data frame.

### References

Garcia, S., et al. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. *IEEE Transactions on Knowledge and Data Engineering*, 25.4, 734-750.

Cebeci, Z. and Yildiz, F. (2017). Unsupervised Discretization of Continuous Variables in a Chicken Egg Quality Traits Dataset. *Turkish Journal of Agriculture-Food Science and Technology*, 5.4, 315-320.

### Examples

```
## Generate random variable
rv <- rnorm(n = 100, mean = 5, sd = 2)
dist <- list("gaussian")
names(dist) <- c("rv")
## Compute the entropy through discretization
entropyData(freqs.table = discretization(data.df = rv, data.dists = dist,
discretization.method = "sturges", nb.states = FALSE))
```

*abn*version 3.1.1 Index]