testDistribution {JWileymisc} | R Documentation |
Test the distribution of a variable against a specific distribution
Description
Function designed to help examine distributions.
It also includes an option for assessing multivariate normality using the
(squared) Mahalanobis distance. A generic function, some methods, and
constructor (as.testDistribution
) and function to check class
(is.testDistribution
) also are provided.
Note that for the use
argument, several options are possible.
By default it is “complete.obs”, which uses only cases with complete
data on all variables.
Another option is “pairwise.complete.obs”, which uses
all available data for each variable indivdiually to estimate the means and
variances, and all pairwise complete observation pairs for each covariance. Because
the same cases are not used for all estimates, it is possible to obtain a covariance
matrix that is not positive definite (e.g., correlations > +1 or < -1).
Finally, the last option is “fiml”, which uses full information maximum likelihood estimates of the means and covariance matrix. Depending on the number of cases, missing data patterns, and variables, this may be quite slow and computationally demanding.
The robust
argument determines whether to use robust estimates or not
when calculating densities, etc. By default it is FALSE
, but if
TRUE
and a univariate or multivariate normal distribution is tested,
then robust estimates of the means and covariance matrix (a variance if univariate)
will be used based on covMcd
from the robustbase package.
Usage
testDistribution(x, ...)
as.testDistribution(x)
is.testDistribution(x)
## Default S3 method:
testDistribution(
x,
distr = c("normal", "beta", "chisq", "f", "gamma", "geometric", "nbinom", "poisson",
"uniform", "mvnormal"),
na.rm = TRUE,
starts,
extremevalues = c("no", "theoretical", "empirical"),
ev.perc = 0.005,
use = c("complete.obs", "pairwise.complete.obs", "fiml"),
robust = FALSE,
...
)
Arguments
x |
The data as a single variable or vector to check the distribution unless the distribution is “mvnormal” in which case it should be a data frame or data table. |
... |
Additional arguments. If these include mu and sigma and the distribution is multivariate normal, then it will use the passed values instead of calculating the mean and covariances of the data. |
distr |
A character string indicating the distribution to be tested. Currently one of: “normal”, “beta”, “chisq” (chi-squared), “f”, “gamma”, “geometric”, “nbinom” (negative binomial), “poisson”, “uniform”, or “mvnormal” for multivariate normal where Mahalanobis distances are calculated and compared against a Chi-squared distribution with degrees of freedom equal to the number of variables. |
na.rm |
A logical value whether to omit missing values. Defaults to |
starts |
A named list of the starting values. Not required for all distributions.
Passed on to |
extremevalues |
A character vector whether to indicate extreme values. Should be “no” to do nothing, “empirical” to show extreme values based on the observed data percentiles, or “theoretical” to show extreme values based on percentiles of the theoretical distribution. |
ev.perc |
Percentile to use for extreme values. For example if .01, then the lowest 1 percent and highest 1 percent will be labelled extreme values. Defaults to the lowest and highest 0.5 percent. |
use |
A character vector indicating how the moments
(means and covariance matrix) should be estimated in the presence of
missing data when |
robust |
A logical whether to use robust estimation or not.
Currently only applies to normally distributed data
(univariate or multivariate). Also, when |
Value
A logical whether or not an object is of class
testDistribution
or an object of the same class.
A list with information about the distribution (parameter estimates, name, log likelihood (useful for comparing the fit of different distributions to the data), and a dataset with the sorted data and theoretical quantiles.
See Also
Examples
## example data
set.seed(1234)
d <- data.table::data.table(
Ynorm = rnorm(200),
Ybeta = rbeta(200, 1, 4),
Ychisq = rchisq(200, 8),
Yf = rf(200, 5, 10),
Ygamma = rgamma(200, 2, 2),
Ynbinom = rnbinom(200, mu = 4, size = 9),
Ypois = rpois(200, 4))
## testing and graphing
testDistribution(d$Ybeta, "beta", starts = list(shape1 = 1, shape2 = 4))
testDistribution(d$Ychisq, "chisq", starts = list(df = 8))
## for chi-square distribution, extreme values only on
## the right tail
testDistribution(d$Ychisq, "chisq", starts = list(df = 8),
extremevalues = "empirical", ev.perc = .1)
testDistribution(d$Ychisq, "chisq", starts = list(df = 8),
extremevalues = "theoretical", ev.perc = .1)
## Not run:
testDistribution(d$Yf, "uniform")
testDistribution(d$Ypois, "geometric")
testDistribution(d$Yf, "f", starts = list(df1 = 5, df2 = 10))
testDistribution(d$Ygamma, "gamma")
testDistribution(d$Ynbinom, "poisson")
testDistribution(d$Ynbinom, "nbinom")
testDistribution(d$Ypois, "poisson")
## compare log likelihood of two different distributions
testDistribution(d$Ygamma, "normal")$Distribution$LL
testDistribution(d$Ygamma, "gamma")$Distribution$LL
testDistribution(d$Ynorm, "normal")
testDistribution(c(d$Ynorm, 10, 1000), "normal",
extremevalues = "theoretical")
testDistribution(c(d$Ynorm, 10, 1000), "normal",
extremevalues = "theoretical", robust = TRUE)
testDistribution(mtcars, "mvnormal")
## for multivariate normal mahalanobis distance
## which follows a chi-square distribution, extreme values only on
## the right tail
testDistribution(mtcars, "mvnormal", extremevalues = "empirical",
ev.perc = .1)
testDistribution(mtcars, "mvnormal", extremevalues = "theoretical",
ev.perc = .1)
rm(d) ## cleanup
## End(Not run)