grts {spsurvey}R Documentation

Select a generalized random tessellation stratified (GRTS) sample

Description

Select a spatially balanced sample from a point (finite), linear / linestring (infinite), or areal / polygon (infinite) sampling frame using the Generalized Random Tessellation Stratified (GRTS) algorithm. The GRTS algorithm accommodates unstratified and stratified sampling designs and allows for equal inclusion probabilities, unequal inclusion probabilities according to a categorical variable, and inclusion probabilities proportional to a positive auxiliary variable. Several additional sampling options are included, such as including legacy (historical) sites, requiring a minimum distance between sites, and selecting replacement sites. For technical details, see Stevens and Olsen (2004).

Usage

grts(
  sframe,
  n_base,
  stratum_var = NULL,
  seltype = NULL,
  caty_var = NULL,
  caty_n = NULL,
  aux_var = NULL,
  legacy_var = NULL,
  legacy_sites = NULL,
  legacy_stratum_var = NULL,
  legacy_caty_var = NULL,
  legacy_aux_var = NULL,
  mindis = NULL,
  maxtry = 10,
  n_over = NULL,
  n_near = NULL,
  wgt_units = NULL,
  pt_density = NULL,
  DesignID = "Site",
  SiteBegin = 1,
  sep = "-",
  projcrs_check = TRUE
)

Arguments

sframe

A sampling frame as an sf object. The coordinate system for sframe must projected (not geographic). If m or z values are in sframe's geometry, they are silently dropped (i.e., only x-coordinates and y-coordinates are preserved).

n_base

The base sample size required. If the sampling design is unstratified, this is a single numeric value. If the sampling design is stratified, this is a named vector or list whose names represent each stratum and whose values represent each stratum's sample size. These names must match the values of the stratification variable represented by stratum_var. Legacy sites are considered part of the base sample, so the value for n_base should be equal to the number of legacy sites plus the number of desired non-legacy sites.

stratum_var

A character string containing the name of the column from sframe that identifies stratum membership for each element in sframe. If stratum equals NULL, the sampling design is unstratified and all elements in sframe are eligible to be selected in the sample. The default is NULL.

seltype

A character string or vector indicating the inclusion probability type, which must be one of following: "equal" for equal inclusion probabilities; "unequal" for unequal inclusion probabilities according to a categorical variable specified by caty_var; and "proportional" for inclusion probabilities proportional to a positive auxiliary variable specified by aux_var. If the sampling design is unstratified, seltype is a single character vector. If the sampling design is stratified, seltype is a named vector whose names represent each stratum and whose values represent each stratum's inclusion probability type. seltype's default value tries to match the intended inclusion probability type: If caty_var and aux_var are not specified, seltype is "equal"; if caty_var is specified, seltype is "unequal"; and if aux_var is specified, seltype is "proportional".

caty_var

A character string containing the name of the column from sframe that represents the unequal probability variable.

caty_n

A character vector indicating the expected sample size for each level of caty_var, the unequal probability variable. If the sampling design is unstratified, caty_n is a named vector whose names represent each level of caty_var and whose values represent each level's expected sample size. The sum of caty_n must equal n_base. If the sampling design is stratified and the expected sample sizes are the same among strata, caty_n is a named vector whose names represent represent each level of caty_var and whose values represent each level's expected sample size – these expected sample sizes are applied to all strata. The sum of caty_n must equal each stratum's value in n_base. If the sampling design is stratified and the expected sample sizes differ among strata, caty_n is a list where each element is named as a stratum in n_base. Each stratum's list element is a named vector whose names represent each level of caty_var and whose values represent each level's expected sample size (within the stratum). The sum of the values in each stratum's list element must equal that stratum's value in n_base.

aux_var

A character string containing the name of the column from sframe that represents the proportional (to size) inclusion probability variable (auxiliary variable). This auxiliary variable must be positive, and the resulting inclusion probabilities are proportional to the values of the auxiliary variable. Larger values of the auxiliary variable result in higher inclusion probabilities.

legacy_var

This argument can be used instead of legacy_sites when sframe is a POINT or MULTIPOINT geometry (i.e. a finite sampling frame), When legacy_var is used, it is a character string containing the name of the column from sframe that represents whether each site is a legacy site. For legacy sites, the values of the legacy_var must contain character strings that act as a legacy site identifier. For non-legacy sites, the values of the legacy_var column must be NA. Using this approach, legacy_stratum_var, legacy_caty_var, and legacy_aux_var are not required and should not be used (because legacy_var represents a column in sframe). spsurvey assumes that the legacy sites were selected from a previous sampling design that incorporated randomness into site selection and that the legacy sites are elements of the current sampling frame.

legacy_sites

An sf object with a POINT or MULTIPOINT geometry representing the legacy sites. spsurvey assumes that the legacy sites were selected from a previous sampling design that incorporated randomness into site selection and that the legacy sites are elements of the current sampling frame. If sframe has a POINT or MULTIPOINT geometry, the observations in legacy_sites should not also be in sframe (i.e., duplicates are not removed). Thus, sframe and legacy_sites together compose the current sampling frame. If m or z values are in legacy_sites' geometry, they are silently dropped (i.e., only x-coordinates and y-coordinates are preserved).

legacy_stratum_var

A character string containing the name of the column from legacy_sites that identifies stratum membership for each element of legacy_sites. This argument is required when the sampling design is stratified and its levels must be contained in the levels of the stratum_var variable. The default value of legacy_stratum_var is stratum_var, so legacy_stratum_var need only be specified explicitly when the name of the stratification variable in legacy_sites differs from stratum_var.

legacy_caty_var

A character string containing the name of the column from legacy_sites that identifies the unequal probability variable for each element of legacy_sites. This argument is required when the sampling design uses unequal selection probabilities and its categories must be contained in the levels of the caty_var variable. The default value of legacy_caty_var is caty_var, so legacy_caty_var need only be specified explicitly when the name of the unequal probability variable in legacy_sites differs from caty_var.

legacy_aux_var

A character string containing the name of the column from legacy_sites that identifies the proportional probability variable for each element of legacy_sites. This argument is required when the sampling design uses proportional selection probabilities and the values of the legacy_aux_var variable must be positive. The default value of legacy_aux_var is aux_var, so legacy_aux_var need only be specified explicitly when the name of the proportional probability variable in legacy_sites differs from aux_var.

mindis

A numeric value indicating the desired minimum distance between sampled sites. If the sampling design is stratified and mindis is an numeric value, the minimum distance is applied to all strata. If the sampling design is stratified and different minimum distances are desired among strata, then mindis is a list whose names match the names of n_base and whose and values are the minimum distance for the corresponding stratum. If a minimum distance is not desired for a particular stratum, then the corresponding value in mindis should be 0 or NULL (which is equivalent to 0). The units of mindis must represent the units in sframe. A warning is returned if the minimum distance could not be reached after maxtry attempts. If legacy sites are used, the minimum distance requirement (and subsequent warning if maxtry attempts are reached) is enforced for all base sites that are not legacy sites (i.e., the minimum distance is enforced for these sites by comparing distances against all base sites (legacy and non-legacy)).

maxtry

The number of maximum attempts to apply the minimum distance algorithm to obtain the desired minimum distance between sites. Each iteration takes roughly as long as the standard GRTS algorithm. Successive iterations will always contain at least as many sites satisfying the minimum distance requirement as the previous iteration. The algorithm stops when the minimum distance requirement is met or there are maxtry iterations. The default number of maximum iterations is 10.

n_over

The number of reverse hierarchically ordered (rho) replacement sites. If the sampling design is unstratified, then n_over is an integer specifying the number of rho replacement sites desired. If the sampling design is stratified, then n_over is a vector (or list) whose names match the names of n_base and whose values indicate the number of rho replacement sites for each stratum. If replacement sites are not desired for a particular stratum, then the corresponding value in n_over should be 0 or NULL (which is equivalent to 0). If the sampling design is stratified but the number of n_over sites is the same in each stratum, n_over can be a vector which is used for each stratum. If n_over is an unnamed, length-one vector, it's value is recycled and used for each stratum. Note that if the sampling design has unequal selection probabilities (seltype = "unequal"), then n_over sites are given the same proportion of caty_n values as n_base.

n_near

The number of nearest neighbor (nn) replacement sites. If the sampling design is unstratified, n_near is integer from 1 to 10 specifying the number of nn replacement sites to be selected for each base site. If the sampling design is stratified but the same number of nn replacement sites is desired for each stratum, n_near is integer from 1 to 10 specifying the number of nn replacement sites to be selected for each base site. If the sampling design is unstratified and a different number of nn replacement sites is desired for each stratum, n_near is a vector (or list) whose names represent strata and whose values is integer from 1 to 10 specifying the number of nn replacement sites to be selected for each base site in the stratum. If replacement sites are not desired for a particular stratum, then the corresponding value in n_over should be 0 or NULL (which is equivalent to 0). For infinite sampling frames, the distance between a site and its nn depends on pt_density. The larger pt_density, the closer the nn neighbors.

wgt_units

The units used to compute the design weights. These units must be standard units as defined by the set_units() function in the units package. The default units match the units of the sf object.

pt_density

A positive integer controlling the density of the GRTS approximation for infinite sampling frames. The GRTS approximation for infinite sample frames vastly improves computational efficiency by generating many finite points and selecting a sample from the points. pt_density represents the density of finite points per unit to use in the approximation. More specifically, for each stratum, the number of points used in the approximation equals pt_density * (n_base + n_over). A larger value of pt_density means a closer approximation to the infinite sampling frame but less computational efficiency. The default value of pt_density is 10. Note that when used with caty_n, the unequal inclusion probabilities generated from this approach are also approximations.

DesignID

A character string indicating the naming structure for each site's identifier selected in the sample, which is matched with SiteBegin and included as a variable in the sf object in the function's output. Default is "Site".

SiteBegin

A character string indicating the first number to use to match with DesignID while creating each site's identifier selected in the sample. Successive sites are given successive integers. The default starting number is 1 and the number of digits is equal to number of digits in nbase + nover. For example, if nbase is 50 and nover is 0, then the default site identifiers are Site-01 to Site-50

sep

A character string that acts as a separator between DesignID and SiteBegin. The default is "-".

projcrs_check

A check for whether the coordinates are projected. If TRUE, an error is returned if coordinates are not projected (i.e., they are geographic or NA). If FALSE, the check is not performed, which means that the crs in sframe (and legacy_sites if provided) can be projected, geographic, or NA.

Details

n_base is the number of sites used to calculate the design weights, which is typically the number of sites used in an analysis. When a panel sampling design is implemented, n_base is typically the number of sites in all panels that will be sampled in the same temporal period – n_base is not the total number of sites in all panels. The sum of n_base and n_over is equal to the total number of sites to be visited for all panels plus any replacement sites that may be required.

Value

The sampling design sites and additional information about the sampling design. More specifically, it is, a list with five elements:

When non-NULL, the sites_legacy, sites_base, sites_over, and sites_near objects contain the original columns in sframe and include a few additional columns. These additional columns are

If any columns in sframe contain these names, those columns from sframe will be automatically prefixed with sframe_ in the sites object. When output is printed, a summary of site counts by the levels in stratum_var and caty_var is shown.

Author(s)

Tony Olsen olsen.tony@epa.gov

References

Stevens Jr., Don L. and Olsen, Anthony R. (2004). Spatially balanced sampling of natural resources. Journal of the American Statistical Association, 99(465), 262-278.

See Also

irs

to select a sample that is not spatially balanced

Examples

## Not run: 
samp <- grts(NE_Lakes, n_base = 100)
print(samp)
strata_n <- c(low = 25, high = 30)
samp_strat <- grts(NE_Lakes, n_base = strata_n, stratum_var = "ELEV_CAT")
print(samp_strat)
samp_over <- grts(NE_Lakes, n_base = 30, n_over = 5)
print(samp_over)

## End(Not run)

[Package spsurvey version 5.5.1 Index]