genpolygon {kpeaks}R Documentation

Generate the Classes to Build a Frequency Polygon

Description

Constructs the histogram of a feature by using a selected binning rule, returns the middle values and frequencies of classes for further works on the frequency polygon.

Usage

genpolygon(x, binrule, nbins, disp = FALSE)

Arguments

x

a numeric vector containing the observations for a feature.

binrule

name of the rule in order to compute the number of bins to build the histogram.

nbins

an integer representing the number of bins which is computed by using the selected binning rule. Default rule is sturges. Depending on the selected rule nbins equals to (In the formulae, n is the number of observations.):

  • floor(sqrt(n)) if the rule is sqr,

  • ceiling(1+log(n, 2)) if the rule is sturges,

  • ceiling(1+3.332*log(n, 10)) if the rule is huntsberger,

  • ceiling(5*log(n, 10)) if the rule is bc,

  • ceiling(n^(1/3)) if the rule is cencov,

  • ceiling(2*n^(1/3)) if the rule is rice,

  • ceiling((2*n)^(1/3)) if the rule is ts,

  • ceiling(((max(x)-min(x))/(3.5*sqrt(var(x))*n^(-1/3)))) if the rule is scott,

  • ceiling(((max(x)-min(x))/(2*IQR(x)*n^(-1/3))) if the rule is fd,

  • ceiling(1+log(n,2)+log(1+abs(skewness(x))/(6*(n-2)/((n+1)*(n+3))^0.5),2)) if the rule is doane,

  • ceiling(log(n)/2*pi) if the rule is cebeci,

  • a user-specified integer if the rule is usr.

disp

a logical value should be set to TRUE to display the histogram.

Details

According to Hyndman (1995), Sturges's rule was the first rule to calculate k, the number of classes to build a histogram. Most of the statistical packages use this simple rule for determining the number of classes in constructing histograms. Brooks & Carruthers (1953) proposed a rule using log_{10} instead of log_{2} giving always larger k when compared to Sturges's rule. The rule by Huntsberger (1962) yields nearly equal result to those of Sturges's rule. These two rules work well if n is less than 200. Scott (1992) argued that Sturges's rule leads to generate oversmoothed histograms in case of large number of n. In his rule, Cencov (1962) used the cube root of n simply. This rule was followed by its extensions, i.e., Rice rule and Terrell & Scott (1985) rule. When compared to the others, the square root rule produces larger k (Davies & Goldsmith, 1980).

Most of the rules simply include only n as the input argument. On the other hand, the rules using variation and shape of data distributions can provide more optimal k values. For instance, Doane (1976) extended the Sturges's rule by adding the standardized skewness in order to overcome the problem with non-normal distributions need more classes. In order to estimate optimal k values, Scott (1979) added the standard deviation to his formula. Freedman and Diaconis (1981) proposed to use the interquartile range (IQR) statistic which is less sensitive to outliers than the standard deviation. In a study on unsupervised discretization methods, Cebeci & Yildiz (2017) tested a binning rule formula based on the ten-base logarithm of n divided by 2*pi. They also argued that the rules Freedman-Diaconis and Doane were slightly performed better than the other rules based on the training model accuracies on a chicken egg quality traits dataset. Therefore, using the above mentioned rules may be more effective in determining the peaks of a frequency polygon.

Value

xm

a numeric vector containing the middle values of bins.

xc

an integer vector containing the frequencies of the bins.

nbins

an integer containing the number of bins to build the histogram.

Author(s)

Zeynel Cebeci, Cagatay Cebeci

References

Brooks C E P & Carruthers N (1953). Handbook of statistical methods in meteorology. H M Stationary Office, London.

Cebeci Z & Yildiz F (2017). Unsupervised discretization of continuous variables in a chicken egg quality traits dataset. Turk. J Agriculture-Food Sci. & Tech. 5(4): 315-320. doi: 10.24925/turjaf.v5i4.315-320.1056.

Cebeci Z & Cebeci C (2018). "A novel technique for fast determination of K in partitioning cluster analysis", Journal of Agricultural Informatics, 9(2), 1-11. doi: 10.17700/jai.2018.9.2.442.

Cebeci Z & Cebeci C (2018). "kpeaks: An R package for quick selection of k for cluster analysis", In 2018 Int. Conf. on Artificial Intelligence and Data Processing (IDAP), IEEE. doi: 10.1109/IDAP.2018.8620896.

Cencov N N (1962). Evaluation of an unknown distribution density from observations. Soviet Mathematics 3: 1559-1562.

Davies O L & Goldsmith P L (1980). Statistical methods in research and production. 4th edn, Longman: London.

Doane D P (1976). Aesthetic frequency classification. American Statistician 30(4):181-183.

Freedman D & Diaconis P (1981). On the histogram as a density estimator: L2 Theory. Zeit. Wahr. ver. Geb. 57(4):453-476.

Hyndman R J (1995). The problem with Sturges rule for constructing histograms. url:http://robjhyndman.com/papers/sturges.pdf.

Huntsberger D V (1962). Elements of statistical inference. London: Prentice-Hall.

Scott D W (1992). Multivariate density estimation: Theory, Practice and Visualization. John Wiley & Sons: New York.

Sturges H (1926). The choice of a class-interval. J Amer. Statist. Assoc. 21(153):65-66.

Terrell G R & Scott D W (1985). Oversmoothed nonparametric density estimates. J Amer. Statist. Assoc. 80(389):209-214.

See Also

findk, findpolypeaks, plotpolygon

Examples

x <- rnorm(n=100, mean=5, sd=0.5)
# Construct the histogram of x according to the Sturges rule with no display
hvals <- genpolygon(x, binrule = "sturges")
print(hvals)

# Plot the histogram of x by using the user-specified number of classes
hvals <- genpolygon(x, binrule = "usr", nbins = 20, disp = TRUE)
print(hvals)

# Plot the histogram of the second feature in iris dataset 
# by using the Freedman-Diaconis (fd) rule
data(iris)
hvals <- genpolygon(iris[,2], binrule = "fd", disp = TRUE)
print(hvals)

[Package kpeaks version 1.1.0 Index]