## The Monte Carlo estimate for the p-value of a discrete KS test based on nested bootstrapped samples

### Description

Computes the Monte Carlo estimate for the p-value of a discrete one-sample Kolmogorov-Smirnov (KS) test based on nested bootstrapped samples for Poisson, geometric, negative binomial, beta binomial, beta negative binomial, normal, log normal, halfnormal, and exponential distributions and their zero-inflated as well as hurdle versions.

### Usage

kstest.B(x,nsim=200,bootstrap=TRUE,dist="poisson",
r=NULL,p=NULL,alpha1=NULL,alpha2=NULL,n=NULL,lambda=NULL,mean=NULL,sigma=NULL,
lowerbound = 0.01, upperbound = 10000, parallel = FALSE)


### Arguments

 x A vector of count data. Should be non-negative integers for discrete cases. Random generation for continuous cases. nsim The number of bootstrapped samples or simulated samples generated to compute p-value. If it is not an integer, nsim will be automatically rounded up to the smallest integer that is no less than nsim. Should be greater than 30. Default is 200. bootstrap Whether to generate bootstrapped samples or not. See Details. 'TRUE' or any numeric non-zero value indicates the generation of bootstrapped samples. The default is 'TRUE'. dist The distribution used as the null hypothesis. Can be one of poisson,geometric,nb,bb,bnb,normal,halfnormal,lognormal,exponential,zip,zigeom,zinb,zibb,zibnb,zinormal,zilognorm,zohalfnorm,ziexp,ph,geomh,nbh,bbh,bnbh, which corresponds to Poisson, geometric, negative binomial, negative binomial1, beta binomial, beta binomial1, beta negative binomial, beta negative binomial1, normal, half normal, log normal, and exponential distributions and their zero-inflated as well as hurdle version, respectively. Defult is 'poisson'. r An initial value of the number of success before which m failures are observed, where m is the element of x. Must be a positive number, but not required to be an integer. p An initial value of the probability of success, should be a positive value within (0,1). alpha1 An initial value for the first shape parameter of beta distribution. Should be a positive number. alpha2 An initial value for the second shape parameter of beta distribution. Should be a positive number. n An initial value of the number of trials. Must be a positive number, but not required to be an integer. lambda An initial value of the rate. Must be a positive real number. mean An initial value of the mean or expectation. sigma An initial value of the standard deviation. Must be a positive real number. lowerbound A lower searching bound used in the optimization of likelihood function. Should be a small positive number. The default is 1e-2. upperbound An upper searching bound used in the optimization of likelihood function. Should be a large positive number. The default is 1e4. parallel whether to use multiple threads paralleling for computation. Default is FALSE. Please aware that it may take longer time to execute the program with parallel=FALSE.

### Details

In arguments nsim, bootstrap, dist, if the length is larger than 1, the first element will be used. For other arguments except for x, the first valid value will be used if the input is not NULL, otherwise some naive sample estimates will be fed into the algorithm. Note that only the initial values that is used in the null distribution dist are needed. For example, with dist=poisson, user should provide a value for lambda and not the other parameters. With an output p-value less than some user-specified significance level, x is probably coming from a distribution other than the dist, given the current data. If p-values of more than one distributions are greater than the pre-specified significance level, user may consider a following likelihood ratio test to select a 'better' distribution. The methodology of computing Monte Carlo p-value is when bootstrap=TRUE, nsim bootstrapped samples will be generated by re-sampling x without replacement. Otherwise, nsim samples are simulated from the null distribution with the maximum likelihood estimate of original data x. Then compute the maximum likelihood estimates of nsim bootstrapped or simulated samples, based on which nsim new samples are generated under the null distribution. nsim KS statistics are calculated for the nsim new samples, then the Monte Carlo p-value is resulted from comparing the nsim KS statistics and the statistic of original data x. During the process of computing maximum likelihood estimates, the negative log likelihood function is minimized via basic R function optim with the searching interval decided by lowerbound and upperbound. Next simulate i.i.d. simulates from the estimated parameters and calculate a new mle based on the bootstrapped samples. Then calculate the KS statistic and the p-value. For large sample sizes we may use kstest.A and for small sample sizes (less that 50 or 100), kstest.B is preferred.

### Value

An object of class 'kstest.A' including the following elements:

• x: x used in computation.

• nsim: nsim used in computation.

• bootstrap: bootstrap used in computation.

• dist: dist used in computation.

• lowerbound: lowerbound used in computation.

• upperbound: upperboound used in computation.

• mle_new: A matrix of the maximum likelihood estimates of unknown parameters under the null distribution, using nsim bootstrapped or simulated samples.

• mle_ori: A row vector of the maximum likelihood estimates of unknown parameters under the null distribution, using the original data x.

• mle_c: A row vector of the maximum likelihood estimates of unknown parameters under the null distribution, using bootstrapped samples with parameters of mle_new.

• pvalue: Monte Carlo p-value of the one-sample KS test.

• N: length of x.

• r: initial value of r used in computation.

• p: initial value of p used in computation.

• alpha1: initial value of alpha1 used in computation.

• alpha2: initial value of alpha2 used in computation.

• lambda: initial value of lambda used in computation.

• n: initial value of n used in computation.

• mean: initial value of mean used in computation.

• sigma: initial value of sigma used in computation.

### References

• H. Aldirawi, J. Yang, A. A. Metwally (2019). Identifying Appropriate Probabilistic Models for Sparse Discrete Omics Data, accepted for publication in 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

set.seed(008)
kstest.B(x,nsim=100,bootstrap = TRUE,dist= 'zinb')$pvalue #0.01 kstest.B(x,nsim=100,bootstrap = TRUE,dist= 'zibb')$pvalue   #0.02
kstest.B(x,nsim=100,bootstrap = TRUE,dist= 'zibnb')$pvalue #0.67 x2=sample.h1(2000,phi=0.3,dist="halfnormal",sigma=4) kstest.B(x2,nsim=100,bootstrap = TRUE,dist= 'halfnormh')$pvalue   #0.73