dfba_bivariate_concordance {DFBA} | R Documentation |
Bayesian Distribution-Free Correlation and Concordance
Description
Given bivariate data, computes the sample number of concordant changes nc
between the two variates and the number of discordant changes nd
.
Provides the frequentist tau_A
correlation coefficient
(nc-nd)/(nc+nd)
, and provides a Bayesian analysis of the population
concordance parameter phi
: the limit of the proportion of concordance
changes between the variates.
For goodness-of-fit applications, provides a concordance measure that
corrects for the number of fitting parameters.
Usage
dfba_bivariate_concordance(
x,
y,
a0 = 1,
b0 = 1,
prob_interval = 0.95,
fitting.parameters = NULL
)
Arguments
x |
Vector of x variable values |
y |
Vector of y variable values |
a0 |
First shape parameter for the prior beta distribution (default is 1) |
b0 |
Second shape parameter for the prior beta distribution (default is 1) |
prob_interval |
Desired width for interval estimates (default is .95) |
fitting.parameters |
(Optional) If either x or y values are generated by a predictive model, the number of free parameters in the model (default is NULL) |
Details
The product-moment correlation depends on Gaussian assumptions about the
residuals in a regression analysis. It is not robust because it is strongly
influenced by any extreme outlier scores for either of the two variates. A
rank-based analysis can avoid both of these limitations. The dfba_bivariate_concordance()
function is focused on a nonparametric concordance metric for characterizing
the association between the two bivariate measures.
To illustrate the nonparametric concepts of concordance and discordance, consider a specific example where there are five paired scores with
x = {3.8, 4.7, 4.7, 4.7, 11.8}
and
y = [5.9, -4.1, 7.3, 7.3, 38.9].
The ranks for the x
variate are 1, 3, 3, 3, 5
and the corresponding
ranks for y
are 2, 1, 3.5, 3.5, 5
, so the five points in terms of
their ranks are P_1 = (1, 2)
, P_2 = (3, 1)
, P_3 = (3, 3.5)
,
P_4 = (3, 3.5)
and P_5 = (5,5)
. The relationship between any two
of these points Pi and Pj, is either (1) concordant if the
sign of R_{xi} - R_{xj}
is the same as the sign of
R_{yi} - R_{yj}
, (2) discordant if signs are
different between R_{xi}-R_{xj}
and R_{yi}-R_{yj}
, or (3) null if
either R_{xi}=R_{xj}
or if R_{yi}=R_{yj}
. For the above example,
there are ten possible comparisons among the five points; six are concordant,
one is discordant, and there are three comparisons lost due to ties. In
general, given n
bivariate scores there are n(n-1)/2
total
possible comparisons. When there are ties in the x
variate, there is
a loss of T_x
comparisons, when there are ties in the y
variate,
there are T_y
lost comparisons. Ties in both x
and y
are denoted
T_{xy}
. The total number of possible comparisons,
accounting for ties, is therefore:
n(n-1)/2-T_x-T_y+T_{xy},
where T_{xy}
is added to avoid double-counting of lost comparisons.
In the above example, there are three lost comparisons due to ties in x
,
one lost comparison due to a tie in y
, and one comparison lost to a tie
in both the x
and y
variates. Thus, there are [(5*4)/2]-3-1+1=7
comparisons for the above example. The \tau_A
correlation is defined as
(n_c-n_d)/(n_c+n_d)
, which is a value on the [-1,1]
interval. However,
it is important to note the original developer of the frequentist \tau
correlation used a different coefficient that has come to be called
\tau_B
, which is given as
(n_c-n_d)/([(n*(n-1)/2)-Tx][(n*(n-1)/2)-Ty])^{.5}
. However, \tau_B
does not properly correct for tied scores, which is unfortunate
because \tau_B
is the value returned by the stats
function
cor(..., method = "kendall")
. If there are no ties, then
T_x = T_y = T_{xy} = 0
and \tau_A = \tau_B
. But if there are ties,
then the proper coefficient is given by \tau_A
. The dfba_bivariate_concordance()
function provides the proper correction for tied scores and outputs a sample
estimate for \tau_A
.
The focus for the Bayesian analysis is on the population proportion
of concordance, which is the limit of the ratio n_c/(n_c+n_d)
. This
proportion is a value on the [0,1]
interval, and it is called \phi
(Phi).
\phi
is also connected to the population \tau_A
because
\tau_A=(2\phi -1)
. Moreover, Chechile (2020) showed that the
likelihood function for observing n_c
concordant changes and n_d
discordant changes is a censored Bernoulli process, so the likelihood is
proportional to (\phi^{n_c})((1-\phi)^{n_d})
. In Bayesian statistics, the
likelihood function is only specified as a proportional function because,
unlike in frequentist statistics, the likelihood of unobserved and more
extreme events are not computed. This idea is the likelihood principle,
and its violation can lead to paradoxes (Lindley & Phillips, 1976). Also, the
likelihood need only be a proportional function because the proportionality
constant appears in both the numerator and denominator of Bayes theorem, so
it cancels out. If the prior for \phi
is a beta distribution, then it
follows that the posterior is also a beta distribution (i.e., the beta
is a natural Bayesian conjugate function for Bernoulli processes). The
default prior for the dfba_bivariate_concordance()
function is the flat prior (i.e.,
a0 = 1
and b0 = 1
).
In the special case where the user has a model for predicting a variate in
terms of known quantities and where there are free-fitting parameters, the
dfba_bivariate_concordance()
function's concordance parameter is a goodness-of-fit measure
for the scientific model. Thus, the bivariate pair are the observed value of
a variate along with the corresponding predicted score from the model. The
concordance proportion must be adjusted in these goodness-of-fit applications
to take into account the number of free parameters that were used
in the prediction model. Chechile and Barch (2021) argued that the fitting
parameters increases the number of concordant changes. Consequently, the
value for n_c
is downward-adjusted as a function of the number of free
parameters. The Chechile-Barch adjusted n_c
value for a case where there
are m
free fitting parameters is n_c-(n*m)+[m*(m+1)/2]
. As an example,
suppose that there are n = 20
scores, and the prediction equation has
m = 2
free parameters that result in creating a prediction for each
observed score (i.e., there are 20 paired values of observed score x
and predicted score y
), and further suppose that this model results in
n_c = 170
and n_d = 20
. The value of n_d
is kept at 20, but
the number of concordant changes is reduced to 170-(20*2)+(2*3/2) = 133.
Value
A list containing the following components:
tau |
Nonparametric Tau-A correlation |
sample_p |
Sample concordance proportion |
nc |
Number of concordant comparisons |
nd |
Number of discordant comparisons |
a_post |
The first shape parameter for the posterior beta distribution for the concordance proportion |
b_post |
The second shape parameter for the posterior beta distribution for the concordance proportion |
a0 |
The first shape parameter for the prior beta distribution for the concordance proportion |
b0 |
The second shape parameter for the prior beta distribution for the concordance proportion |
prob_interval |
The probability within the interval estimates for the phi parameter |
post_median |
Median of posterior distribution on phi |
eti_lower |
Lower limit of the equal-tail interval with width specified by prob_interval |
eti_upper |
Upper limit of the equal-tail interval with width specified by prob_interval |
tau_star |
Corrected tau_A to account for the number of free fitting parameter in goodness-of-fit applications |
nc_star |
The corrected number of concordant comparisons for a goodness-of-fit application when there is an integer value for |
nd_star |
The number of discordant comparison when there is an integer value for |
sample_p_star |
Correct proportion of concordant comparisons to account for free-fitting parameter for goodness-of-fit applications |
a_post_star |
Corrected value for the first shape parameter for the posterior for the concordance proportion when there are free fitting parameter for goodness-of-fit applications |
b_post_star |
The second shape parameter for the posterior distribution for the concordance proportion when there is a goodness-of-fit application |
post_median_star |
The posterior median for the concordance proportion when there is a goodness-of-fit application |
eti_lower_star |
Lower limit for the interval estimate when there is a goodness-of-fit application |
eti_upper_star |
Upper limt for the interval estimate when there is a goodness-of-fit application |
References
Chechile, R.A. (2020). Bayesian Statistics for Experimental Scientists: A General Introduction Using Distribution_Free Statistics. Cambridge: MIT Press.
Chechile, R.A., & Barch, D.H. (2021). A distribution-free, Bayesian goodness-of-fit method for assessing similar scientific prediction equations. Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2021.102638
Lindley, D. V., & Phillips, L. D. (1976). Inference for a Bernoulli process (a Bayesian view). The American Statistician, 30, 112-119.
Examples
x <- c(47, 39, 47, 42, 44, 46, 39, 37, 29, 42, 54, 33, 44, 31, 28, 49, 32, 37, 46, 55, 31)
y <- c(36, 40, 49, 45, 30, 38, 39, 44, 27, 48, 49, 51, 27, 36, 30, 44, 42, 41, 35, 49, 33)
dfba_bivariate_concordance(x, y)
## A goodness-of-fit example for a hypothetical case of fitting data in a
## yobs vector with prediction model
p = seq(.05,.95,.05)
ypred= 17.332 - (50.261*p) + (48.308*p^2)
# Note the coefficients in the ypred equation were found first via a
# polynomial regression
yobs<-c(19.805, 10.105, 9.396, 8.219, 6.110, 4.543, 5.864, 4.861, 6.136,
5.789, 5.443, 5.548, 4.746, 6.484, 6.185, 6.202, 9.804, 9.332,
14.408)
dfba_bivariate_concordance(x = yobs,
y = ypred,
fitting.parameters = 3)