R: Weighted sample quantiles

weighted_quantile {ggdist}

R Documentation

Weighted sample quantiles

Description

A variation of quantile() that can be applied to weighted samples.

Usage

weighted_quantile(
  x,
  probs = seq(0, 1, 0.25),
  weights = NULL,
  n = NULL,
  na.rm = FALSE,
  names = TRUE,
  type = 7,
  digits = 7
)

weighted_quantile_fun(x, weights = NULL, n = NULL, na.rm = FALSE, type = 7)

Arguments

`x`	numeric vector: sample values
`probs`	numeric vector: probabilities in `[0, 1]`
`weights`	Weights for the sample. One of: numeric vector of same length as `x`: weights for corresponding values in `x`, which will be normalized to sum to 1. `NULL`: indicates no weights are provided, so unweighted sample quantiles (equivalent to `quantile()`) are returned.
`n`	Presumed effective sample size. If this is greater than 1 and continuous quantiles (`type >= 4`) are requested, flat regions may be added to the approximation to the inverse CDF in areas where the normalized weight exceeds `1/n` (i.e., regions of high density). This can be used to ensure that if a sample of size `n` with duplicate `x` values is summarized into a weighted sample without duplicates, the result of `weighted_quantile(..., n = n)` on the weighted sample is equal to the result of `quantile()` on the original sample. One of: `NULL`: do not make a sample size adjustment. numeric: presumed effective sample size. function or name of function (as a string): A function applied to `weights` (prior to normalization) to determine the sample size. Some useful values may be: `"length"`: i.e. use the number of elements in `weights` (equivalently in `x`) as the effective sample size. `"sum"`: i.e. use the sum of the unnormalized `weights` as the sample size. Useful if the provided `weights` is unnormalized so that its sum represents the true sample size.
`na.rm`	logical: if `TRUE`, corresponding entries in `x` and `weights` are removed if either is `NA`.
`names`	logical: If `TRUE`, add names to the output giving the input `probs` formatted as a percentage.
`type`	integer between 1 and 9: determines the type of quantile estimator to be used. Types 1 to 3 are for discontinuous quantiles, types 4 to 9 are for continuous quantiles. See Details.
`digits`	numeric: the number of digits to use to format percentages when `names` is `TRUE`.

Details

Calculates weighted quantiles using a variation of the quantile types based on a generalization of quantile().

Type 1–3 (discontinuous) quantiles are directly a function of the inverse CDF as a step function, and so can be directly translated to the weighted case using the natural definition of the weighted ECDF as the cumulative sum of the normalized weights.

Type 4–9 (continuous) quantiles require some translation from the definitions in quantile(). quantile() defines continuous estimators in terms of x_k, which is the kth order statistic, and p_k, which is a function of k and n (the sample size). In the weighted case, we instead take x_k as the kth smallest value of x in the weighted sample (not necessarily an order statistic, because of the weights). Then we can re-write the formulas for p_k in terms of F(x_k) (the empirical CDF at x_k, i.e. the cumulative sum of normalized weights) and f(x_k) (the normalized weight at x_k), by using the fact that, in the unweighted case, k = F(x_k) \cdot n and 1/n = f(x_k):

Type 4: p_k = \frac{k}{n} = F(x_k)
Type 5: p_k = \frac{k - 0.5}{n} = F(x_k) - \frac{f(x_k)}{2}
Type 6: p_k = \frac{k}{n + 1} = \frac{F(x_k)}{1 + f(x_k)}
Type 7: p_k = \frac{k - 1}{n - 1} = \frac{F(x_k) - f(x_k)}{1 - f(x_k)}
Type 8: p_k = \frac{k - 1/3}{n + 1/3} = \frac{F(x_k) - f(x_k)/3}{1 + f(x_k)/3}
Type 9: p_k = \frac{k - 3/8}{n + 1/4} = \frac{F(x_k) - f(x_k) \cdot 3/8}{1 + f(x_k)/4}

Then the quantile function (inverse CDF) is the piece-wise linear function defined by the points (p_k, x_k).

Value

weighted_quantile() returns a numeric vector of length(probs) with the estimate of the corresponding quantile from probs.