dfba_mann_whitney {DFBA} | R Documentation |
Independent Samples Test (Mann Whitney U)
Description
Given two independent vectors E
and C
, the function computes
the sample Mann-Whitney U
statistics U_E
and U_C
and
provides a Bayesian analysis for the population parameter omega_E
,
which is the population ratio of U_E/(U_E+U_C)
.
Usage
dfba_mann_whitney(
E,
C,
a0 = 1,
b0 = 1,
prob_interval = 0.95,
samples = 30000,
method = NULL,
hide_progress = FALSE
)
Arguments
E |
Data for independent sample 1 ("Experimental") |
C |
Data for independent sample 2 ("Control") |
a0 |
The first shape parameter for the prior beta distribution for |
b0 |
The second shape parameter for the prior beta distribution for |
prob_interval |
Desired probability value for the interval estimate for |
samples |
The number of Monte Carlo samples for |
method |
(Optional) The method option is either "small" or "large". The "small" algorithm is based on a discrete Monte Carlo solution for cases where n is typically less than 20. The "large" algorithm is based on beta approximation model for the posterior distribution for the omega_E parameter. This approximation is reasonable when n > 19. Regardless of |
hide_progress |
(Optional) If |
Details
The Mann-Whitney U test is the frequentist nonparametric counterpart
to the independent-groups t
-test. The sample U_E
statistic is
the number of times that the E variate is larger than the
C variate, whereas U_C
is the converse number.
This test uses only rank information, so it is robust with respect to
outliers, and it does not depend on the assumption of a normal model for the
variates. The Bayesian version for the Mann-Whitney is focused on the
population parameter omega_E
, which is the population ratio
U_E/(U_E+U_C)
.
While the frequentist test effectively assumes the sharp null hypothesis that
omega_E
is .5, the Bayesian analysis has a prior and posterior
distribution for omega_E
on the [0, 1] interval. The prior is a beta
distribution with shape parameters a0
and b0
. The default is
the flat prior (a0 = b0 =
1), but this prior can be altered by the
user.
The prob_interval
input is the value for probability interval estimates for
omega_E. There are two cases depending on the sample size for the E
and C variates. When the samples sizes are small, there is a discrete
approximation method used. In this case, the Bayesian analysis considers 200
discrete values for omega_E
from .0025 to .9975 in steps of .005. For
each discrete value, a prior and a posterior probability are obtained. The
posterior probabilities are based on Monte Carlo sampling to approximate the
likelihood of obtaining the observed U_E
and U_C
values for each candidate
value for omega_E. For each candidate value for omega_E, the likelihood for
the observed sample U statistics does not depend on the true distributions of
the E and C variates in the population. For each candidate
omega_E
, the software constructs two exponential variates that have
the same omega_E value. The argument samples
specifies the number of
Monte Carlo samples used for each candidate value of omega_E
.
For large sample sizes of the E and C variates,
the Bayesian posterior distribution is closely approximated by a beta
distribution where the shape parameters are a function of the sample
U_E
and U_C
statistics. The large-sample beta approximation was
developed from extensive previous empirical studies designed to approximate
the quantiles of the discrete approach with the corresponding quantiles for a
particular beta distribution. The large-n solution also uses Lagrange
polynomials for interpolation. The large-n approximation is reasonably
accurate when n > 19
for each condition. When the method
input
is omitted, the function selects the appropriate procedure (i.e.,
either the discrete case for a small sample size or the large-n
approach). Nonetheless, the user can stipulate which method they desire
regardless of sample size by inputting either method="small"
or
method="large"
. The large-n solution is rapid compared
to the small-sample solution, so care should be executed when choosing the
method="small"
, even for large sample sizes.
Technical details of the analysis are explained in the Chechile (2020) Communications in Statistics paper cited below.
Value
A list containing the following components:
Emean |
Mean of the independent sample 1 ("Experimental") data |
Cmean |
Mean of the independent sample 1 ("Control") data |
n_E |
Number of observations of the independent sample 1 ("Experimental") data |
n_C |
Mean of observations of the independent sample 2 ("Control") data |
U_E |
Total number of comparisons for which observations from independent sample 1 ("Experimental") data exceed observations from independent sample 2 ("Control") data) |
U_C |
Total number of comparisons for which observations from independent sample 2 ("Control") data exceed observations from independent sample 1 ("Experimental") data) |
prob_interval |
User-defined width of |
a0 |
First shape parameter for the prior beta distribution |
b0 |
Second shape parameter for the prior beta distribution |
a_post |
First shape parameter for the posterior beta distribution |
b_post |
Second shape parameter for the posterior beta distribution |
samples |
The number of desired Monte Carlo samples (default is 30000) |
method |
A character string indicating the calculation method used |
omega_E |
A vector of values representing candidate values for |
omegapost |
A vector of values representing discrete probabilities for candidate values of |
priorvector |
A vector of values representing prior discrete probabilities of candidate values of |
priorprH1 |
Prior probability of the alternative model that omega_E exceeds 0.5 |
prH1 |
Posterior probability of the alternative model that omega_E exceeds 0.5 |
BF10 |
Bayes Factor describing the relative increase in the posterior odds for the alternative model that |
omegabar |
Posterior mean estimate for |
eti_lower |
Lower limit of the equal-tail probability interval for |
eti_upper |
Upper limit of the equal-tail probability interval for |
hdi_lower |
Lower limit of the highest-density probability interval for |
hdi_upper |
Upper limit of the highest-density probability interval for |
References
Chechile, R.A. (2020). Bayesian Statistics for Experimental Scientists: A General Introduction Using Distribution-Free Methods. Cambridge: MIT Press.
Chechile, R.A. (2020). A Bayesian analysis for the Mann-Whitney statistic. Communications in Statistics – Theory and Methods 49(3): 670-696. https://doi.org/10.1080/03610926.2018.1549247.
Examples
# Note: examples with method = "small" have long runtimes due to Monte Carlo
# sampling; please feel free to run them in the console.
# Examples with large n per group
# The data for each condition are presorted only for the user convenience if
# checking the U stats by hand
groupA <- c(43, 45, 47, 50, 54, 58, 60, 63, 69, 84, 85, 91, 99, 127, 130,
147, 165, 175, 193, 228, 252, 276)
groupB <- c(0, 01, 02, 03, 05, 14, 15, 23, 23, 25, 27, 32, 57, 105, 115, 158,
161, 181, 203, 290)
dfba_mann_whitney(E = groupA,
C = groupB)
# The following uses a Jeffreys prior instead of a default flat prior:
dfba_mann_whitney(E = groupA,
C = groupB,
a0 = .5,
b0 =.5)
# The following also uses a Jeffreys prior but the analysis reverses the
# variates:
dfba_mann_whitney(E = groupB,
C = groupA,
a0 = .5,
b0 = .5)
# Note that BF10 from the above analysis is 1/BF10 from the original order
# of the variates.
# The next analysis constructs 99% interval estimates with the Jeffreys
# prior.
AB <- dfba_mann_whitney(E = groupA,
C = groupB,
a0 = .5,
b0 = .5,
prob_interval=.99)
AB
# Plot with prior and posterior curves
plot(AB)
# Plot with posterior curve only
plot(AB,
plot.prior = FALSE)
# Example with small n per group
groupC <- c(96.49, 96.78, 97.26, 98.85, 99.75, 100.14, 101.15, 101.39,
102.58, 107.22, 107.70, 113.26)
groupD <- c(101.16, 102.09, 103.14, 104.70, 105.27, 108.22, 108.32, 108.51,
109.88, 110.32, 110.55, 113.42)
dfba_mann_whitney(E = groupC,
C = groupD,
samples = 250,
hide_progress = TRUE)