edf {timeplyr}R Documentation

Grouped empirical cumulative distribution function applied to data

Description

Like dplyr::cume_dist(x) and ecdf(x)(x) but with added grouping and weighting functionality.
You can calculate the empirical distribution of x using aggregated data by supplying frequency weights. No expansion occurs which makes this function extremely efficient for this type of data, of which plotting is a common application.

Usage

edf(x, g = NULL, wt = NULL)

Arguments

x

Numeric vector.

g

Numeric vector of group IDs.

wt

Frequency weights.

Value

A numeric vector the same length as x.

Examples

library(timeplyr)
library(dplyr)
library(ggplot2)

set.seed(9123812)
x <- sample(seq(-10, 10, 0.5), size = 10^2, replace = TRUE)
plot(sort(edf(x)))
all.equal(edf(x), ecdf(x)(x))
all.equal(edf(x), cume_dist(x))

# Manual ECDF plot using only aggregate data
y <- rnorm(100, 10)
start <- floor(min(y) / 0.1) * 0.1
grid <- time_span(y, time_by = 0.1, from = start)
counts <- time_countv(y, time_by = 0.1, from = start, complete = TRUE)$n
edf <- edf(grid, wt = counts)
# Trivial here as this is the same
all.equal(unname(cumsum(counts)/sum(counts)), edf)

# Full ecdf
tibble(x) %>%
  ggplot(aes(x = y)) +
  stat_ecdf()
# Approximation using aggregate only data
tibble(grid, edf) %>%
  ggplot(aes(x = grid, y = edf)) +
  geom_step()

# Grouped example
g <- sample(letters[1:3], size = 10^2, replace = TRUE)

edf1 <- tibble(x, g) %>%
  mutate(edf = cume_dist(x),
         .by = g) %>%
  pull(edf)
edf2 <- edf(x, g = g)
all.equal(edf1, edf2)


[Package timeplyr version 0.8.1 Index]