R: Computes alpha, probability, expected co-occurence, median...

affinity {CooccurrenceAffinity}

R Documentation

Computes alpha, probability, expected co-occurence, median interval, various confidence intervals, other indices of affinity, etc.

Description

This is the principal function of "CooccurrenceAffinity" package that analyzes occurrence or abundance data (e.g., species by site) using other functions of this package and returns several quantities in one main dataframe and (optionally) up to 11 square matrices. This function processes data using dataprep() function and then feeds the data to analytical pipeline which includes ML.Alpha() and AlphInts(). The outputs of the function in the $all dataframe include the following:

alpha_mle: maximum likelihood estimate of the log-odds parameter alpha in the Extended Hypergeometric distribution with fixed margins (mA,mB) and table-total N, which is the "log-affinity" index of co-occurrence championed in a paper by Mainali et al. (2022) as an index of co-occurrence-based similarity; computed in ML.Alpha()
exp_cooccur: expected co-occurrence count under the null (hypergeometric, corresponding to alpha=0) distribution; computed as ML.Alpha()$Null.Exp
p_value: the commonly reported P-value of the observed co-occurrences; computed by AlphInts()$pval
alpha_medianInt: the interval of alpha values compatible with x as median for the Extended Hypergeometric distribution (Harkness 1965)
with fixed margins and alpha; computed in AlphInts() as $MedianIntrvl
conf_level: confidence level for estimating the various types of confidence intervals
ci_: fout types of confidence intervals (see details below)
jaccard: Jaccard index
sorensen: Sørensen-Dice index
simpson: Simpson index

Usage

affinity(
  data,
  row.or.col,
  which.row.or.col = NULL,
  datatype = NULL,
  threshold = NULL,
  class0.rule = NULL,
  sigPval = NULL,
  sigdigit = NULL,
  squarematrix = NULL,
  ...
)

Arguments

`data`	occurrence matrix (binary or abundance) in matrix or dataframe format
`row.or.col`	specify if the pairs of rows or columns are analyzed for affinity. 'row' or 'column'.
`which.row.or.col`	a vector of name or the number of row/column if a subset of the data is intended to be analyzed; optional argument with default of all rows/columns.
`datatype`	specify if the datatype is 'abundance' or 'binary'; optional argument with default 'binary'.
`threshold`	cutoff for converting an abundance data to binary; needed if datatype is 'abundance'
`class0.rule`	'less.or.equal' or 'less'. 'less.or.equal' converts a threshold or lower values to zero and all the others to 1. 'less' converts a threshold and higher values to 1.
`sigPval`	acceptable rate of false positives or probability of rejecting the null hypothesis when it is true, commonly known as alpha in hypothesis testing
`sigdigit`	the number of decimals for rouding alpha mle, its intervals, expected cooccurence under the null, jaccard, sorensen and simpson indices
`squarematrix`	a vector of quantities so that a square matrix for each of them is generated on the top of the main long matrix of all outputs. "alpha_mle", "alpha_mle_sig", "p_value", "cooccur.null", "cooccur.obs", "jaccard", "jaccard_sig", "sorensen", "sorensen_sig", "simpson", "simpson_sig", "all".
`...`	Additional arguments to control behavior of the function.

Details

This function calculates "alpha_mle", which is the maximum likelihood estimate of the log-odds paramater alpha within the Extended Hypergeometric distribition (Harkness 1965) based on the count x and fixed table margins (mA,mB) and total N, which is the "affinity" index of co-occurrence championed in the paper of Mainali et al. (2022) as an index of cooccurrence-based similarity.

This function calculates five intervals, three of them using EHypQuInt, one using EHypMidP, and one using AcceptAffCI. First ("alpha_medianInt") is the interval of alpha values compatible with x as median for the Extended Hypergeometric distribution (Harkness 1965) with fixed margins and alpha. Computed as AlphInts()$MedianIntrvl. This interval quantifies the underlying discreteness of the Extended Hypergeometric and its impact on the estimation of alpha. MedianIntrvl is an interval that will contain the MLE alpha-hat, and the mid-point of that interval is another reasonable estimator of alpha from the data.

There are four confidence intervals computed in AlphInts(), called ci_cp, ci_blaker, ci_midQ, ci_midP, matching name of the outputs in standalone outputs of AlphInts(), except the differences in capital/small letters. The boolean "bound" parameter is an option to prevent the intervals containing alpha-estimates to extend to plus or minus infinity, based on a Bayesian argument. The bound substituted for the Infinite endpoints is provably larger than the largest value the MLE can take whenever x avoids the enpoints max(mA+mB-N,0) and min(mA,mB) of its logical range. The recommended confidence interval for alpha is CI.Blaker if a reliably conservative (over-large) coverage probability is desired, and CI.midP otherwise.

"ci_cp", computed as AlphInts()$CI.CP is an "exact" conservative test-based 2-sided confidence interval (analogous to the Clopper-Pearson (1934) confidence interval for unknown binomial proportion) for alpha based on data (x,mA,mB,N)

"ci_blaker", computed as AlphInts()$CI.Blaker is the Acceptability Confidence Interval of Blaker (2000, Theorem 1) which is a better confidence interval than the CP-type interval "CI.CP" in the sense of being contained within "CI.CP" but still probably conservative, i.e., with coverage probability always at least as large as the nominal level.

"ci_midQ", computed as AlphInts()$CI.midQ has the endpoints obtained as the midpoints of quantile intervals respectively to the (1+lev)/2 and (1-lev)/2 quantiles of the Extended Hypergeometric distribution.

"ci_midP", computed as AlphInts()$CI.midQ, behaves very similarly to "CI.midQ" and is defined by the midP approach analogous to the midP confidence interval for binomial proportions (Agresti 2013, p.605), and is calculated from EHypMidP().

The recommended (slightly conservative) confidence interval is CI.Blaker, while the very similar intervals CI.midQ and CI.midP have coverage generally closer than CI.CP or CI.Blaker to the nominal level of coverage, at the cost of occasionally under-covering by as much as 0.04 or 0.05 for confidence levels 0.90 or 0.95. The comparison among intervals, and different possible goals that CIs of conservative or close-to-nominal coverage can serve, are similar to those compared by Brown et al. (2001) for interval estimation of an unknown binomial proportion.

"p_value" is the two-sided p-value for the equal-tailed test of the null hypothesis alpha=0. This p-value is calculated when pval="Blaker" according to Blaker's (2000) "Acceptability" function; if the input parameter pvalType of AlphInts() is anything else, the p-value is calculated using the same idea as the midP confidence interval.

ADDITIONAL ARGUMENTS can be supplied from ML.Alpha() and AlphInts().

Value

This function returns one main long dataframe ($all) with various outputs in columns (a list given under "details") for each of the pairs of the entities in row. This function also outputs optionally upto 11 square matrices of NxN entities.

Author(s)

Kumar Mainali

References

Agresti, A. (2013) Categorical Data Analysis, 3rd edition, Wiley.

Blaker, H. (2000), “Confidence curves and improved exact confidence intervals for discrete distributions", Canadian Journal of Statistics 28, 783-798.

Brown, L., T. Cai, and A. DasGupta (2001), “Interval Estimation for a Binomial Proportion,” Statistical Science, 16, 101–117.

Clopper, C., and E. Pearson (1934), “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial,” Biometrika, 26, 404–413.

Fog, A. (2015), BiasedUrn: Biased Urn Model Distributions. R package version 1.07.

Harkness, W. (1965), “Properties of the extended hypergeometric distribution“, Annals of Mathematical Statistics, 36, 938-945.

Mainali, K., Slud, E., Singer, M. and Fagan, W. (2022), "A better index for analysis of co-occurrence and similarity", Science Advances, to appear.

Examples

# when you have a binary presence absence occurrence data
# -------------------------------------------------------

if(requireNamespace("cooccur", quietly = TRUE)) {
  data(finches, package = "cooccur")
} else {
  message("The cooccur package is not installed.
          You can install it with install.packages('cooccur').")
}

data(finches, package = "cooccur")



# the remainder of the script has been enclosed under \donttest{}
# to bypass the CRAN's 5 second limit on example files
# --------------------------------------------------------------




head(finches)
# this dataset carries the occurrence records of 13 species in row in 17 islands in columns
dim(finches)

# compute alpha and other quantities for island-pair affinity (beta diversity)
myout <- affinity(data = finches, row.or.col = "col")
myout

# you can simply flip the analysis to rows to compute species-pair affinity
myout <- affinity(data = finches, row.or.col = "row")
myout



# the rows of the outputs above include every single pair of the entities,
# producing many columns for various quantities.
# for several of these columns, the function allows outputing the square NxN matrix of the entities
# an example is here
myout <- affinity(data = finches, row.or.col = "col", squarematrix = c("alpha_mle", "jaccard"))
# it is a list of three elements: one main dataframe and two square matrices
length(myout)
myout
head(myout)

# you can also compute all the square matrices with an "all"
myout <- affinity(data = finches, row.or.col = "col", squarematrix = c("all"))
# this one has 12 elements
length(myout)
myout

# when you want to compute for only certain pairs
myout <- affinity(data = finches, row.or.col = "col", which.row.or.col = 4:6,
                  squarematrix = c("alpha_mle"))
myout

myout <- affinity(data = finches, row.or.col = "col",
                  which.row.or.col = c("Isabella", "Espanola"), squarematrix = c("alpha_mle"))
myout

 #end of \donttest{}



# if you select only one column, the computation stops
## Not run: 
  myout <- affinity(data = finches, row.or.col = "col",
                    which.row.or.col = c("Darwin"), squarematrix = c("alpha_mle"))

## End(Not run)





# you can also add additional arguments bringing them from ML.Alpha() or AlphInts()
myout1 <- affinity(data = finches, row.or.col = "col",
                   which.row.or.col = c("Isabella", "Espanola"), lev=0.95, pvalType="Blaker")
myout1
myout2 <- affinity(data = finches, row.or.col = "col",
                   which.row.or.col = c("Isabella", "Espanola"), lev=0.90, pvalType="Blaker")
myout2
identical(myout1, myout2)
# myout1 and myout2 were generated with identical arguments except a difference in "lev",
# which gave different confidence intervals

myout3 <- affinity(data = finches, row.or.col = "col",
                   which.row.or.col = 4:6, lev=0.95, pvalType="Blaker")
myout3
myout4 <- affinity(data = finches, row.or.col = "col",
                   which.row.or.col = 4:6, lev=0.95, pvalType="midP")
myout4
myout3$all$p_value
myout4$all$p_value
# the p values are (or, can be) different

# when you have abundance data requiring conversion to binary
# -----------------------------------------------------------
# abundance data is converted to binary based on a threshold supplied.
# it might be a good idea to explore dataprep() function and its examples
# first before workign on affinity() for abundance data.
matrix.data <- matrix(runif(400, 0, 10), nrow = 100, ncol = 4)
row.names(matrix.data) <- paste0("id_", 1:nrow(matrix.data))
colnames(matrix.data) <- paste0("variable_", 1:ncol(matrix.data))

# add some missing data and zero abundance
matrix.data[1,1] <- matrix.data[2,3] <- matrix.data[1,4] <- matrix.data[1,2] <- NA
matrix.data[10,4] <- 0
head(matrix.data)
# now this is an abundance data with some missing and some zero occurrences

# inspecting how the abundance is converted to binary first
dataprep(data = matrix.data, row.or.col = "col", datatype = "abundance",
         threshold = 5, class0.rule = "less")
myout10 <- affinity(data = matrix.data, row.or.col = "col",
                    datatype = "abundance", threshold = 5, class0.rule = "less")
myout10

# you can also feed the output of dataprep() to affinity()
myinput <- dataprep(data = matrix.data, row.or.col = "col",
                    datatype = "abundance", threshold = 5, class0.rule = "less")
myout11 <- affinity(data = myinput, row.or.col = "col", datatype = "binary")
myout11
# myout 10 and myout11 are identical
identical(myout10, myout11)

 # end of \donttest{}

[Package CooccurrenceAffinity version 1.0 Index]