dis.kstest {iZID} | R Documentation |
The Monte Carlo estimate for the p-value of a discrete KS Test
Description
Computes the Monte Carlo estimate for the p-value of a discrete one-sample Kolmogorov-Smirnov (KS) Test for Poisson, negative binomial, beta binomial, beta negative binomial distributions and their zero-inflated as well as hurdle versions.
Usage
dis.kstest(x, nsim = 100, bootstrap = TRUE, distri = "Poisson",
r = NULL, p = NULL, alpha1 = NULL, alpha2 = NULL, n = NULL,
lowerbound = 0.01, upperbound = 10000, parallel = FALSE)
Arguments
x |
A vector of count data. Should be non-negative integers. If elements of x are not integers, they will be automatically rounded up to the smallest integers that are no less than themselves. |
nsim |
The number of bootstrapped samples or simulated samples generated to compute p-value. If it is not an integer, nsim will be automatically rounded up to the smallest integer that is no less than nsim. Should be greater than 30. Default is 100. |
bootstrap |
Whether to generate bootstrapped samples or not. See Details. 'TRUE' or any numeric non-zero value indicates the generation of bootstrapped samples. The default is 'TRUE'. |
distri |
The distribution used as the null hypothesis. Can be one of {'poisson','nb','bb', 'bnb','zip','zinb','zibb', zibnb','ph','nbh','bbh','bnbh'}, which corresponds to Poisson, negative binomial, beta binomial and beta negative binomial distributions and their zero-inflated as well as hurdle versions, respectively. Default is 'Poisson'. |
r |
An initial value of the number of success before which m failures are observed, where m is the element of x. Must be a positive number, but not required to be an integer. |
p |
An initial value of the probability of success, should be a positive value within (0,1). |
alpha1 |
An initial value for the first shape parameter of beta distribution. Should be a positive number. |
alpha2 |
An initial value for the second shape parameter of beta distribution. Should be a positive number. |
n |
An initial value of the number of trials. Must be a positive number, but not required to be an integer. |
lowerbound |
A lower searching bound used in the optimization of likelihood function. Should be a small positive number. The default is 1e-2. |
upperbound |
An upper searching bound used in the optimization of likelihood function. Should be a large positive number. The default is 1e4. |
parallel |
whether to use multiple threads to parallelize computation. Default is FALSE. Please aware that it may take
longer time to execute the program with |
Details
For arguments nsim
, bootstrap
, distri
, if the length is larger than 1, only the first element will be used.
For other arguments except for x
, the first valid value will be used if the input is not NULL
, otherwise some
naive sample estimates will be fed into the algorithm. Note that only the initial values that are occurred in the null
distribution distri
are needed. For example, with distri=poisson
, user may provide a value for lambda
but
not for r
or p
, though it won't disturb the algorithm.
With an output p-value less than some user-specified significance level, x
is very likely from a distribution other
than the distri
, given the current data. If p-values of more than one distributions are greater than the pre-specified
significance level, user may consider a following likelihood ratio test to select a 'better' distribution.
The methodology of computing Monte Carlo p-value is taken from Aldirawi et al. (2019). With bootstrap=TRUE
, nsim
bootstrapped samples will be generated by resampling x
without replacement. Otherwise, nsim
samples are
simulated from the null distribution with the maximum likelihood estimate of original data x
. Then compute the maximum
likelihood estimates of nsim
bootstrapped or simulated samples, based on which nsim
new samples are generated
under the null distribution. nsim
KS statistics are calculated for the nsim
new samples, then the Monte Carlo
p-value is resulted from comparing the nsim
KS statistics and the statistic of original data x
.
During the process of computing maximum likelihood estimates, the negative log likelihood function is minimized via basic R
function optim
with the searching interval decided by lowerbound
and upperbound
, except that the optimization
of p
takes 1-lowerbound
as the upper searching bound.
To accelerate the whole process, the algorithm uses the parallel strategy via the packages foreach
and
doParallel
.
Value
An object of class 'dis.kstest' including the following elements:
x:
x
used in computation.nsim: nsim used in computation.
bootstrap: bootstrap used in computation.
distri: distri used in computation..
lowerbound: lowerbound used in computation.
upperbound: upperboound used in computation.
mle_new: A matrix of the maximum likelihood estimates of unknown parameters under the null distribution, using
nsim
bootstrapped or simulated samples.mle_ori: A row vector of the maximum likelihood estimates of unknown parameters under the null distribution, using the original data
x
.pvalue: Monte Carlo p-value of the one-sample KS test.
N: length of
x
.r: initial value of r used in computation.
p: initial value of p used in computation.
alpha1: initial value of alpha1 used in computation.
alpha2: initial value of alpha2 used in computation.
n: initial value of n used in computation.
Reference
H. Aldirawi, J. Yang, A. A. Metwally (2019). Identifying Appropriate Probabilistic Models for Sparse Discrete Omics Data, accepted for publication in 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).
T. Wolodzko (2019). extraDistr: Additional Univariate and Multivariate Distributions, R package version 1.8.11, https://CRAN.R-project.org/package=extraDistr.
R. Calaway, Microsoft Corporation, S. Weston, D. Tenenbaum (2017). doParallel: Foreach Parallel Adaptor for the 'parallel' Package, R package version 1.0.11, https://CRAN.R-project.org/package=doParallel.
R. Calaway, Microsoft, S. Weston (2017). foreach: Provides Foreach Looping Construct for R, R package version 1.4.4, https://CRAN.R-project.org/package=foreach.
See Also
Examples
set.seed(2001)
temp1=sample.zi(N=300,phi=0.3,distri='poisson',lambda=5)
dis.kstest(temp1,nsim=100,bootstrap=TRUE,distri='Poisson')$pvalue
dis.kstest(temp1,nsim=100,bootstrap=TRUE,distri='nb')$pvalue
dis.kstest(temp1,nsim=100,bootstrap=TRUE,distri='zip')$pvalue
dis.kstest(temp1,nsim=100,bootstrap=TRUE,distri='zinb')$pvalue