R: Obtain Zipf's Distribution Parameters from Data

obtain_zipfs_parameters {latentFactoR}

R Documentation

Obtain Zipf's Distribution Parameters from Data

Description

Zipf's distribution is commonly found for text data. Closely related to the Pareto and power-law distributions, the Zipf's distribution produces highly skewed data. This function obtains the best fitting parameters to Zipf's distribution

Usage

obtain_zipfs_parameters(data)

Arguments

data

Numeric vector, matrix, or data frame. Numeric data to determine Zipf's distribution parameters

Details

The best parameters are optimized by minimizing the aboslute difference between the original frequencies and the frequencies obtained by the beta and alpha parameters in the following formula (Piantadosi, 2014):

f(r) proportional to 1 / (r + beta)^alpha

where f(r) is the rth most frequency, r is the rank-order of the data, beta is a shift in the rank (following Mandelbrot, 1953, 1962), and alpha is the power of the rank with greater values suggesting greater differences between the largest frequency to the next, and so forth.

Value

Returns a vector containing the estimated beta and alpha parameters. Also contains zipfs_sse which corresponds to the sum of square error between frequencies based on the parameter values estimated and the original data frequencies

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>, Hudson Golino <hfg9s@virginia.edu>, Luis Eduardo Garrido <luisgarrido@pucmm.edu>

References

Mandelbrot, B. (1953). An informational theory of the statistical structure of language. Communication Theory, 84, 486–502.

Mandelbrot, B. (1962). On the theory of word frequencies and on related Markovian models of discourse. Structure of Language and its Mathematical Aspects, 190–219.

Piantadosi, S. T. (2014). Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112-1130.

Examples

# Generate factor data
two_factor <- simulate_factors(
  factors = 2, # factors = 2
  variables = 6, # variables per factor = 6
  loadings = 0.55, # loadings between = 0.45 to 0.65
  cross_loadings = 0.05, # cross-loadings N(0, 0.05)
  correlations = 0.30, # correlation between factors = 0.30
  sample_size = 1000 # number of cases = 1000
)

# Transform data to Mandelbrot's Zipf's
two_factor_zipfs <- data_to_zipfs(
  lf_object = two_factor,
  beta = 2.7,
  alpha = 1
)

# Obtain Zipf's distribution parameters
obtain_zipfs_parameters(two_factor_zipfs$data)

[Package latentFactoR version 0.0.6 Index]