preseqR.sample.cov {preseqR}R Documentation

Predicting generalized sample coverage

Description

preseqR.sample.cov predicts the probability of observing a species represented at least r times in a random sample.

Usage

  preseqR.sample.cov(n, r=1, mt=20)

Arguments

n

A two-column matrix. The first column is the frequency j = 1,2,\dots; and the second column is N_j, the number of species with each species represented exactly j times in the initial sample. The first column must be sorted in an ascending order.

r

A positive integer. Default is 1.

mt

A positive integer constraining possible rational function approximations. Default is 20.

Details

Suppose a sample is given and one more individual is randomly drawn from the population. preseqR.sample.cov estimates the probability of the species, which represents the individual, has been observed at least r times in the sample. When r = 1, the probability is called the sample coverage.

Let N_j be the number of species represented exactly j times in a sample. The probability of observing a species represented at least r times in the sample is estimated as \sum_{j=r+1}^\infty jN_j / \sum_{j=1}^\infty jN_j. The theory is described by Mao and Lindsay (2002). For a random sample where N_j is unknown, a modified rational function approximation is first used to predict the value of N_j. Then the estimates are substituted to obtain an estimator for the probability of observing a species represented at least r times in the sample.

This function is the fast version of preseqR.sample.cov.bootstrap. The function does not provide the confidence interval. To obtain the confidence interval along with the estimates, one should use the function preseqR.sample.cov.bootstrap.

Value

The estimator for the probability of observing a species represented at least r times in a random sample. The input of the estimator is a vector of sampling efforts t, i.e., the relative sample sizes comparing with the initial sample. For example, t = 2 means a random sample that is twice the size of the initial sample.

Author(s)

Chao Deng

References

Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4), 237-264.

Mao, C. X. and Lindsay, B. G. (2002). A Poisson model for the coverage problem with a genomic application. Biometrika, 89(3), 669-682.

Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804v3.

Examples

## load library
library(preseqR)

## import data
data(FisherButterfly)

## construct the estimator for the sample coverage
estimator1 <- preseqR.sample.cov(FisherButterfly, r=1) 
## Given a sample that is 10 times or 20 times the size of an initial samples,
## suppose one randomly draws one more individual from the population. The
## value of the function is the probability that the representing species 
## has been observed in the sample
estimator1(c(10, 20))

## construct the estimator
estimator2 <- preseqR.sample.cov(FisherButterfly, r=2)
## the probability a species represented at least twice when the sample size
## is 50 times or 100 times of the initial sample
estimator2(c(50, 100))

[Package preseqR version 4.0.0 Index]