R: Transforms 'simulate_factors' Data to Zipf's Distribution

data_to_zipfs {latentFactoR}

R Documentation

Transforms `simulate_factors` Data to Zipf's Distribution

Description

Zipf's distribution is commonly found for text data. Closely related to the Pareto and power-law distributions, the Zipf's distribution produces highly skewed data. This transformation is intended to mirror the data generating process of Zipf's law seen in semantic network and topic modeling data.

Usage

data_to_zipfs(lf_object, beta = 2.7, alpha = 1, dichotomous = FALSE)

Arguments

`lf_object`	Data object from `simulate_factors`
`beta`	Numeric (length = 1). Sets the shift in rank. Defaults to `2.7`
`alpha`	Numeric (length = 1). Sets the power of the rank. Defaults to `1`
`dichotomous`	Boolean (length = 1). Whether data should be dichotomized rather than frequencies (e.g., semantic network analysis). Defaults to `FALSE`

Details

The formula used to transform data is (Piantadosi, 2014):

f(r) proportional to 1 / (r + beta)^alpha

where f(r) is the rth most frequency, r is the rank-order of the data, beta is a shift in the rank (following Mandelbrot, 1953, 1962), and alpha is the power of the rank with greater values suggesting greater differences between the largest frequency to the next, and so forth.

The function will transform continuous data output from simulate_factors. See examples to get started

Value

Returns a list containing:

`data`	Simulated data that has been transform to follow Zipf's distribution
`RMSE`	A vector of root mean square errors for transformed data and data assumed to follow theoretical Zipf's distribution and Spearman's correlation matrix of the transformed data compared to the original population correlation matrix
`spearman_correlation`	Spearman's correlation matrix of the transformed data
`original_correlation`	Original population correlation matrix before the data were transformed
`original_results`	Original `lf_object` input into function

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>, Hudson Golino <hfg9s@virginia.edu>, Luis Eduardo Garrido <luisgarrido@pucmm.edu>

References

Mandelbrot, B. (1953). An informational theory of the statistical structure of language. Communication Theory, 84, 486–502.

Mandelbrot, B. (1962). On the theory of word frequencies and on related Markovian models of discourse. Structure of Language and its Mathematical Aspects, 190–219.

Piantadosi, S. T. (2014). Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112-1130.

Zipf, G. (1936). The psychobiology of language. London, UK: Routledge.

Zipf, G. (1949). Human behavior and the principle of least effort. New York, NY: Addison-Wesley.

Examples

# Generate factor data
two_factor <- simulate_factors(
  factors = 2, # factors = 2
  variables = 6, # variables per factor = 6
  loadings = 0.55, # loadings between = 0.45 to 0.65
  cross_loadings = 0.05, # cross-loadings N(0, 0.05)
  correlations = 0.30, # correlation between factors = 0.30
  sample_size = 1000 # number of cases = 1000
)

# Transform data to Mandelbrot's Zipf's
two_factor_zipfs <- data_to_zipfs(
  lf_object = two_factor,
  beta = 2.7,
  alpha = 1
)

# Transform data to Mandelbrot's Zipf's (dichotomous)
two_factor_zipfs_binary <- data_to_zipfs(
  lf_object = two_factor,
  beta = 2.7,
  alpha = 1,
  dichotomous = TRUE
)

[Package latentFactoR version 0.0.6 Index]

Transforms simulate_factors Data to Zipf's Distribution