R: Fraction of k-mers observed at least r times

kmer.frac.curve {preseqR}

R Documentation

Fraction of `k`-mers observed at least `r` times

Description

kmer.frac.curve predicts the expected fraction of k-mers observed at least r times in a high-throughput sequencing experiment given the amount of sequencing

Usage

  kmer.frac.curve(n, k, read.len, seq, r=2, mt=20)

Arguments

`n`	A two-column matrix. The first column is the frequency `j = 1,2,\dots`; and the second column is `N_j`, the number of `k`-mers observed exactly `j` times in the initial experiment. The first column must be sorted in an ascending order.
`k`	The number of nucleotides in a `k`-mer.
`read.len`	The average length of a read.
`seq`	The amount of nucleotides sequenced..
`r`	A positive integer. Default is 1.
`mt`	An positive integer constraining possible rational function approximations. Default is 20.

Details

kmer.frac.curve is mainly designed for metagenomics to evaluate how saturated a metagenomic data is.

kmer.frac.curve is the fast version of kmer.frac.curve.bootstrap. The function does not provide the confidence interval. To obtain the confidence interval along with the estimates, one should use the function kmer.frac.curve.bootstrap.

Value

A two-column matrix. The first column is the amount of sequencing in an experiment. The second column is the estimate of the fraction of k-mers observed at least r times in the experiment.

Author(s)

Chao Deng

References

Deng, C and Smith, AD (2016). Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804

Examples

## load library
library(preseqR)

## import data
data(SRR061157_k31)

## the fraction of 31-mers represented at least 10 times in an experiment when
## sequencing 1M, 10M, 100M, 1G, 10G, 100G, 1T nucleotides
kmer.frac.curve(n=SRR061157_k31, k=31, read.len=100, seq=10^(6:12), r=10, mt=20)

[Package preseqR version 4.0.0 Index]