R: Perform repeated sampling

rep_sample_n {infer}

R Documentation

Perform repeated sampling

Description

These functions extend the functionality of dplyr::sample_n() and dplyr::slice_sample() by allowing for repeated sampling of data. This operation is especially helpful while creating sampling distributions—see the examples below!

Usage

rep_sample_n(tbl, size, replace = FALSE, reps = 1, prob = NULL)

rep_slice_sample(
  .data,
  n = NULL,
  prop = NULL,
  replace = FALSE,
  weight_by = NULL,
  reps = 1
)

Arguments

`tbl`, `.data`	Data frame of population from which to sample.
`size`, `n`, `prop`	`size` and `n` refer to the sample size of each sample. The `size` argument to `rep_sample_n()` is required, while in `rep_slice_sample()` sample size defaults to 1 if not specified. `prop`, an argument to `rep_slice_sample()`, refers to the proportion of rows to sample in each sample, and is rounded down in the case that `prop * nrow(.data)` is not an integer. When using `rep_slice_sample()`, please only supply one of `n` or `prop`.
`replace`	Should samples be taken with replacement?
`reps`	Number of samples to take.
`prob`, `weight_by`	A vector of sampling weights for each of the rows in `.data`—must have length equal to `nrow(.data)`. For `weight_by`, this may also be an unquoted column name in `.data`.

Details

rep_sample_n() and rep_slice_sample() are designed to behave similar to their dplyr counterparts. As such, they have at least the following differences:

In case replace = FALSE having size bigger than number of data rows in rep_sample_n() will give an error. In rep_slice_sample() having such n or prop > 1 will give warning and output sample size will be set to number of rows in data.

Note that the dplyr::sample_n() function has been superseded by dplyr::slice_sample().

Value

A tibble of size reps * n rows corresponding to reps samples of size n from .data, grouped by replicate.

Examples

library(dplyr)
library(ggplot2)
library(tibble)

# take 1000 samples of size n = 50, without replacement
slices <- gss %>%
  rep_slice_sample(n = 50, reps = 1000)

slices

# compute the proportion of respondents with a college
# degree in each replicate
p_hats <- slices %>%
  group_by(replicate) %>%
  summarize(prop_college = mean(college == "degree"))

# plot sampling distribution
ggplot(p_hats, aes(x = prop_college)) +
  geom_density() +
  labs(
    x = "p_hat", y = "Number of samples",
    title = "Sampling distribution of p_hat"
  )

# sampling with probability weights. Note probabilities are automatically
# renormalized to sum to 1
df <- tibble(
  id = 1:5,
  letter = factor(c("a", "b", "c", "d", "e"))
)

rep_slice_sample(df, n = 2, reps = 5, weight_by = c(.5, .4, .3, .2, .1))

# alternatively, pass an unquoted column name in `.data` as `weight_by`
df <- df %>% mutate(wts = c(.5, .4, .3, .2, .1))

rep_slice_sample(df, n = 2, reps = 5, weight_by = wts)

[Package infer version 1.0.7 Index]