R: Bayesian Distribution-Free Correlation and Concordance

dfba_bivariate_concordance {DFBA}

R Documentation

Bayesian Distribution-Free Correlation and Concordance

Description

Given bivariate data, computes the sample number of concordant changes nc between the two variates and the number of discordant changes nd. Provides the frequentist tau_A correlation coefficient (nc-nd)/(nc+nd), and provides a Bayesian analysis of the population concordance parameter phi: the limit of the proportion of concordance changes between the variates. For goodness-of-fit applications, provides a concordance measure that corrects for the number of fitting parameters.

Usage

dfba_bivariate_concordance(
  x,
  y,
  a0 = 1,
  b0 = 1,
  prob_interval = 0.95,
  fitting.parameters = NULL
)

Arguments

`x`	Vector of x variable values
`y`	Vector of y variable values
`a0`	First shape parameter for the prior beta distribution (default is 1)
`b0`	Second shape parameter for the prior beta distribution (default is 1)
`prob_interval`	Desired width for interval estimates (default is .95)
`fitting.parameters`	(Optional) If either x or y values are generated by a predictive model, the number of free parameters in the model (default is NULL)

Details

The product-moment correlation depends on Gaussian assumptions about the residuals in a regression analysis. It is not robust because it is strongly influenced by any extreme outlier scores for either of the two variates. A rank-based analysis can avoid both of these limitations. The dfba_bivariate_concordance() function is focused on a nonparametric concordance metric for characterizing the association between the two bivariate measures.

To illustrate the nonparametric concepts of concordance and discordance, consider a specific example where there are five paired scores with

x = {3.8, 4.7, 4.7, 4.7, 11.8}

and

y = [5.9, -4.1, 7.3, 7.3, 38.9].

The ranks for the x variate are 1, 3, 3, 3, 5 and the corresponding ranks for y are 2, 1, 3.5, 3.5, 5, so the five points in terms of their ranks are P_1 = (1, 2), P_2 = (3, 1), P_3 = (3, 3.5), P_4 = (3, 3.5) and P_5 = (5,5). The relationship between any two of these points Pi and Pj, is either (1) concordant if the sign of R_{xi} - R_{xj} is the same as the sign of R_{yi} - R_{yj}, (2) discordant if signs are different between R_{xi}-R_{xj} and R_{yi}-R_{yj}, or (3) null if either R_{xi}=R_{xj} or if R_{yi}=R_{yj}. For the above example, there are ten possible comparisons among the five points; six are concordant, one is discordant, and there are three comparisons lost due to ties. In general, given n bivariate scores there are n(n-1)/2 total possible comparisons. When there are ties in the x variate, there is a loss of T_x comparisons, when there are ties in the y variate, there are T_y lost comparisons. Ties in both x and y are denoted T_{xy}. The total number of possible comparisons, accounting for ties, is therefore:

n(n-1)/2-T_x-T_y+T_{xy},

where T_{xy} is added to avoid double-counting of lost comparisons.

In the above example, there are three lost comparisons due to ties in x, one lost comparison due to a tie in y, and one comparison lost to a tie in both the x and y variates. Thus, there are [(5*4)/2]-3-1+1=7 comparisons for the above example. The \tau_A correlation is defined as (n_c-n_d)/(n_c+n_d), which is a value on the [-1,1] interval. However, it is important to note the original developer of the frequentist \tau correlation used a different coefficient that has come to be called \tau_B, which is given as (n_c-n_d)/([(n*(n-1)/2)-Tx][(n*(n-1)/2)-Ty])^{.5}. However, \tau_B does not properly correct for tied scores, which is unfortunate because \tau_B is the value returned by the stats function cor(..., method = "kendall"). If there are no ties, then T_x = T_y = T_{xy} = 0 and \tau_A = \tau_B. But if there are ties, then the proper coefficient is given by \tau_A. The dfba_bivariate_concordance() function provides the proper correction for tied scores and outputs a sample estimate for \tau_A.

The focus for the Bayesian analysis is on the population proportion of concordance, which is the limit of the ratio n_c/(n_c+n_d). This proportion is a value on the [0,1] interval, and it is called \phi (Phi). \phi is also connected to the population \tau_A because \tau_A=(2\phi -1). Moreover, Chechile (2020) showed that the likelihood function for observing n_c concordant changes and n_d discordant changes is a censored Bernoulli process, so the likelihood is proportional to (\phi^{n_c})((1-\phi)^{n_d}). In Bayesian statistics, the likelihood function is only specified as a proportional function because, unlike in frequentist statistics, the likelihood of unobserved and more extreme events are not computed. This idea is the likelihood principle, and its violation can lead to paradoxes (Lindley & Phillips, 1976). Also, the likelihood need only be a proportional function because the proportionality constant appears in both the numerator and denominator of Bayes theorem, so it cancels out. If the prior for \phi is a beta distribution, then it follows that the posterior is also a beta distribution (i.e., the beta is a natural Bayesian conjugate function for Bernoulli processes). The default prior for the dfba_bivariate_concordance() function is the flat prior (i.e., a0 = 1 and b0 = 1).

In the special case where the user has a model for predicting a variate in terms of known quantities and where there are free-fitting parameters, the dfba_bivariate_concordance() function's concordance parameter is a goodness-of-fit measure for the scientific model. Thus, the bivariate pair are the observed value of a variate along with the corresponding predicted score from the model. The concordance proportion must be adjusted in these goodness-of-fit applications to take into account the number of free parameters that were used in the prediction model. Chechile and Barch (2021) argued that the fitting parameters increases the number of concordant changes. Consequently, the value for n_c is downward-adjusted as a function of the number of free parameters. The Chechile-Barch adjusted n_c value for a case where there are m free fitting parameters is n_c-(n*m)+[m*(m+1)/2]. As an example, suppose that there are n = 20 scores, and the prediction equation has m = 2 free parameters that result in creating a prediction for each observed score (i.e., there are 20 paired values of observed score x and predicted score y), and further suppose that this model results in n_c = 170 and n_d = 20. The value of n_d is kept at 20, but the number of concordant changes is reduced to 170-(20*2)+(2*3/2) = 133.

Value

A list containing the following components:

`tau`	Nonparametric Tau-A correlation
`sample_p`	Sample concordance proportion
`nc`	Number of concordant comparisons
`nd`	Number of discordant comparisons
`a_post`	The first shape parameter for the posterior beta distribution for the concordance proportion
`b_post`	The second shape parameter for the posterior beta distribution for the concordance proportion
`a0`	The first shape parameter for the prior beta distribution for the concordance proportion
`b0`	The second shape parameter for the prior beta distribution for the concordance proportion
`prob_interval`	The probability within the interval estimates for the phi parameter
`post_median`	Median of posterior distribution on phi
`eti_lower`	Lower limit of the equal-tail interval with width specified by prob_interval
`eti_upper`	Upper limit of the equal-tail interval with width specified by prob_interval
`tau_star`	Corrected tau_A to account for the number of free fitting parameter in goodness-of-fit applications
`nc_star`	The corrected number of concordant comparisons for a goodness-of-fit application when there is an integer value for `fitting.parameters`
`nd_star`	The number of discordant comparison when there is an integer value for `fitting.parameters`
`sample_p_star`	Correct proportion of concordant comparisons to account for free-fitting parameter for goodness-of-fit applications
`a_post_star`	Corrected value for the first shape parameter for the posterior for the concordance proportion when there are free fitting parameter for goodness-of-fit applications
`b_post_star`	The second shape parameter for the posterior distribution for the concordance proportion when there is a goodness-of-fit application
`post_median_star`	The posterior median for the concordance proportion when there is a goodness-of-fit application
`eti_lower_star`	Lower limit for the interval estimate when there is a goodness-of-fit application
`eti_upper_star`	Upper limt for the interval estimate when there is a goodness-of-fit application

References

Chechile, R.A. (2020). Bayesian Statistics for Experimental Scientists: A General Introduction Using Distribution_Free Statistics. Cambridge: MIT Press.

Chechile, R.A., & Barch, D.H. (2021). A distribution-free, Bayesian goodness-of-fit method for assessing similar scientific prediction equations. Journal of Mathematical Psychology. https://doi.org/10.1016/j.jmp.2021.102638

Lindley, D. V., & Phillips, L. D. (1976). Inference for a Bernoulli process (a Bayesian view). The American Statistician, 30, 112-119.

Examples



x <- c(47, 39, 47, 42, 44, 46, 39, 37, 29, 42, 54, 33, 44, 31, 28, 49, 32, 37, 46, 55, 31)
y <- c(36, 40, 49, 45, 30, 38, 39, 44, 27, 48, 49, 51, 27, 36, 30, 44, 42, 41, 35, 49, 33)

dfba_bivariate_concordance(x, y)

## A goodness-of-fit example for a hypothetical case of fitting data in a
## yobs vector with prediction model

p = seq(.05,.95,.05)
ypred= 17.332 - (50.261*p) + (48.308*p^2)

# Note the coefficients in the ypred equation were found first via a
# polynomial regression

yobs<-c(19.805, 10.105, 9.396, 8.219, 6.110, 4.543, 5.864, 4.861, 6.136,
         5.789,  5.443, 5.548, 4.746, 6.484, 6.185, 6.202, 9.804, 9.332,
         14.408)

dfba_bivariate_concordance(x = yobs,
         y = ypred,
         fitting.parameters = 3)

[Package DFBA version 0.1.0 Index]