R: Small Sample Exact Test for Counts in Bins

small_samptest {ptools}

R Documentation

Small Sample Exact Test for Counts in Bins

Description

Small sample test statistic for counts of N items in bins with particular probability.

Usage

small_samptest(d, p = rep(1/length(d), length(d)), type = "G", cdf = FALSE)

Arguments

`d`	vector of counts, e.g. c(0,2,1,3,1,4,0) for counts of crimes in days of the week
`p`	vector of baseline probabilities, defaults to equal probabilities in each bin
`type`	string specifying "G" for likelihhood ratio G stat (the default), "V" for Kuipers test (for circular data), "KS" for Komolgrov-Smirnov test, and "Chi" for Chi-square test
`cdf`	if `FALSE` (the default) generates a new permutation vector (using `exactProb`), else pass it a final probability dataset previously created

Details

This construct a null distribution for small sample statistics for N counts in M bins. Example use cases are to see if a repeat offender have a proclivity to commit crimes on a particular day of the week (see the referenced paper). It can also be used for Benford's analysis of leading/trailing digits for small samples. Referenced paper shows G test tends to have the most power, although with circular data may consider Kuiper's test.

Value

A small_sampletest object with slots for:

CDF, a dataframe that contains the exact probabilities and test statistic values for every possible permutation
probabilities, the null probabilities you specified
data, the observed counts you specified
test, the type of test conducted (e.g. G, KS, Chi, etc.)
test_stat, the test statistic for the observed data
p_value, the p-value for the observed stat based on the exact null distribution
AggregateStatistics, here is a reduced form aggregate table for the CDF/p-value calculation

If you wish to save the object, you may want to get rid of the CDF part, it can be quite large. It will have a total of choose(n+n-1,m-1) total rows, where m is the number of bins and n is the total counts. So if you have 10 crimes in 7 days of the week, it will result in a dataframe with choose(7 + 10 - 1,7-1), which is 8008 rows. Currently I keep the CDF part though to make it easier to calculate power for a particular test

References

Nigrini, M. J. (2012). Benford's Law: Applications for forensic accounting, auditing, and fraud detection. John Wiley & Sons.

Wheeler, A. P. (2016). Testing Serial Crime Events for Randomness in Day-of-Week Patterns with Small Samples. Journal of Investigative Psychology and Offender Profiling, 13(2), 148-165.

Examples

# Counts for different days of the week
d <- c(3,1,1,0,0,1,1) #format N observations in M bins
res <- small_samptest(d=d,type="G")
print(res)

# Example for Benfords analysis
f <- 1:9
p_fd <- log10(1 + (1/f)) #first digit probabilities
#check data from Nigrini page 84
checks <- c(1927.48,27902.31,86241.90,72117.46,81321.75,97473.96,
           93249.11,89658.17,87776.89,92105.83,79949.16,87602.93,
           96879.27,91806.47,84991.67,90831.83,93766.67,88338.72,
           94639.49,83709.28,96412.21,88432.86,71552.16)
# To make example run a bit faster
c1 <- checks[1:10]
#extracting the first digits
fd <- substr(format(c1,trim=TRUE),1,1)
tot <- table(factor(fd, levels=paste(f)))
resG <- small_samptest(d=tot,p=p_fd,type="Chi")
resG

#Can reuse the cdf table if you have the same number of observations
c2 <- checks[11:20]
fd2 <- substr(format(c2,trim=TRUE),1,1)
t2 <- table(factor(fd2, levels=paste(f)))
resG2 <- small_samptest(d=t2,p=p_fd,type="Chi",cdf=resG$CDF)

[Package ptools version 2.0.0 Index]