obtain_zipfs_parameters {latentFactoR} | R Documentation |
Obtain Zipf's Distribution Parameters from Data
Description
Zipf's distribution is commonly found for text data. Closely related to the Pareto and power-law distributions, the Zipf's distribution produces highly skewed data. This function obtains the best fitting parameters to Zipf's distribution
Usage
obtain_zipfs_parameters(data)
Arguments
data |
Numeric vector, matrix, or data frame. Numeric data to determine Zipf's distribution parameters |
Details
The best parameters are optimized by minimizing the aboslute difference between the original frequencies and the frequencies obtained by the beta and alpha parameters in the following formula (Piantadosi, 2014):
f(r) proportional to 1 / (r + beta)^alpha
where f(r) is the rth most frequency, r is the rank-order of the data, beta is a shift in the rank (following Mandelbrot, 1953, 1962), and alpha is the power of the rank with greater values suggesting greater differences between the largest frequency to the next, and so forth.
Value
Returns a vector containing the estimated beta
and
alpha
parameters. Also contains zipfs_sse
which corresponds
to the sum of square error between frequencies based
on the parameter values estimated and the original data frequencies
Author(s)
Alexander P. Christensen <alexpaulchristensen@gmail.com>, Hudson Golino <hfg9s@virginia.edu>, Luis Eduardo Garrido <luisgarrido@pucmm.edu>
References
Mandelbrot, B. (1953). An informational theory of the statistical structure of language. Communication Theory, 84, 486–502.
Mandelbrot, B. (1962). On the theory of word frequencies and on related Markovian models of discourse. Structure of Language and its Mathematical Aspects, 190–219.
Piantadosi, S. T. (2014). Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112-1130.
Examples
# Generate factor data
two_factor <- simulate_factors(
factors = 2, # factors = 2
variables = 6, # variables per factor = 6
loadings = 0.55, # loadings between = 0.45 to 0.65
cross_loadings = 0.05, # cross-loadings N(0, 0.05)
correlations = 0.30, # correlation between factors = 0.30
sample_size = 1000 # number of cases = 1000
)
# Transform data to Mandelbrot's Zipf's
two_factor_zipfs <- data_to_zipfs(
lf_object = two_factor,
beta = 2.7,
alpha = 1
)
# Obtain Zipf's distribution parameters
obtain_zipfs_parameters(two_factor_zipfs$data)