R: A Non-parametric Test of Sample Exchangeability and Feature...

getPValue {flintyR}

R Documentation

A Non-parametric Test of Sample Exchangeability and Feature Independence

Description

The V test computes the p-value of a multivariate dataset \mathbf{X}, which informs the user about one of two decisions: (1) whether the sample is exchangeable at a given significance level, assuming that the feature dependencies are known; or (2) whether the features or groups of features are independent at a given significance level, assuming that the sample is exchangeable. See Aw, Spence and Song (2023) for details.

Usage

getPValue(
  X,
  block_boundaries = NULL,
  block_labels = NULL,
  largeP = FALSE,
  largeN = FALSE,
  nruns = 5000,
  type = "unbiased",
  p = 2
)

Arguments

`X`	The binary or real matrix on which to perform test of exchangeability.
`block_boundaries`	Vector denoting the positions where a new block of non-independent features starts. Default is NULL.
`block_labels`	Length `P` vector recording the block label of each feature. Default is NULL.
`largeP`	Boolean indicating whether to use large `P` asymptotics. Default is FALSE.
`largeN`	Boolean indicating whether to use large `N` asymptotics. Default is FALSE.
`nruns`	Resampling number for exact test. Default is 5000.
`type`	Either an unbiased estimate of ('unbiased', default), or valid, but biased estimate of, ('valid') p-value (see Hemerik and Goeman, 2018), or both ('both'). Default is 'unbiased'. Note that unbiased estimate can return `0`.
`p`	The power `p` of `l_p^p`, i.e., `\|\|x\|\|_p^p = (x_1^p+...x_n^p)`. Default is 2.

Details

Automatically detects if dataset is binary, and runs the Hamming distance version of test if so. Otherwise, computes the squared Euclidean distance between samples and evaluates whether the variance of Euclidean distances, V, is atypically large under the null hypothesis of exchangeability. Note the user may tweak the choice of power p if they prefer an l_p^p distance other than Euclidean.

Under the hood, the variance statistic, V, is computed efficiently. Moreover, the user can specify their choice of block permutations, large P asymptotics, or large P and large N asymptotics. The latter two return reasonably accurate p-values for moderately large dimensionalities.

User recommendations: When the number of independent blocks B or number of independent features P is at least 50, it is safe to use large P asymptotics. If P or B is small, however, stick with permutations.

Dependencies: All functions in auxiliary.R

Value

The p-value to be used to test the null hypothesis of exchangeability.

Examples

# Example 1 (get p-value of small matrix with independent features using exact test)
suppressWarnings(require(doParallel))
# registerDoParallel(cores = 2)

X1 <- matrix(nrow = 5, ncol = 10, rbinom(50, 1, 0.5)) # binary matrix, small
getPValue(X1) # perform exact test with 5000 permutations

# should be larger than 0.05

# Example 2 (get p-value of high-dim matrix with independent features using asymptotic test)
X2 <- matrix(nrow = 10, ncol = 1000, rnorm(1e4)) # real matrix, large enough
getPValue(X2, p = 2, largeP = TRUE) # very fast

# should be larger than 0.05
# getPValue(X2, p = 2) # slower, do not run (Output: 0.5764)

# Example 3 (get p-value of high-dim matrix with partitionable features using exact test)

X3 <- matrix(nrow = 10, ncol = 1000, rbinom(1e4, 1, 0.5))
getPValue(X3, block_labels = rep(c(1,2,3,4,5), 200))

# Warning message: # there are features that have zero variation (i.e., all 0s or 1s)
# In getPValue(X3, block_labels = rep(c(1, 2, 3, 4, 5), 200)) :
# There exist columns with all ones or all zeros for binary X.

# Example 4 (get p-value of high-dim matrix with partitionable features using asymptotic test)

## This elaborate example generates binarized versions of time series data.

# Helper function to binarize a marker
# by converting z-scores to {0,1} based on
# standard normal quantiles
binarizeMarker <- function(x, freq, ploidy) {
 if (ploidy == 1) {
   return((x > qnorm(1-freq)) + 0)
 } else if (ploidy == 2) {
   if (x <= qnorm((1-freq)^2)) {
     return(0)
   } else if (x <= qnorm(1-freq^2)) {
     return(1)
   } else return(2)
 } else {
   cat("Specify valid ploidy number, 1 or 2")
 }
}

getAutoRegArray <- function(B, N, maf_l = 0.38, maf_u = 0.5, rho = 0.5, ploid = 1) {
# get minor allele frequencies by sampling from uniform
mafs <- runif(B, min = maf_l, max = maf_u)
# get AR array
ar_array <- t(replicate(N, arima.sim(n = B, list(ar=rho))))
# theoretical column variance
column_var <- 1/(1-rho^2)
# rescale so that variance per marker is 1
ar_array <- ar_array / sqrt(column_var)
# rescale each column of AR array
for (b in 1:B) {
  ar_array[,b] <- sapply(ar_array[,b],
                         binarizeMarker,
                         freq = mafs[b],
                         ploidy = ploid)
}
return(ar_array)
}

## Function to generate the data array with desired number of samples
getExHaplotypes <- function(N) {
  array <- do.call("cbind",
                   lapply(1:50, function(x) {getAutoRegArray(N, B = 20)}))
  return(array)
}

##  Generate data and run test
X4 <- getExHaplotypes(10)
getPValue(X4, block_boundaries = seq(from = 1, to = 1000, by = 25), largeP = TRUE)

# stopImplicitCluster()

[Package flintyR version 0.1.0 Index]