getPValue {flintyR} | R Documentation |
A Non-parametric Test of Sample Exchangeability and Feature Independence
Description
The V test computes the p-value of a multivariate dataset \mathbf{X}
, which
informs the user about one of two decisions: (1) whether the sample is exchangeable
at a given significance level, assuming that the feature dependencies are known;
or (2) whether the features or groups of features are independent at a given significance
level, assuming that the sample is exchangeable.
See Aw, Spence and Song (2023) for details.
Usage
getPValue(
X,
block_boundaries = NULL,
block_labels = NULL,
largeP = FALSE,
largeN = FALSE,
nruns = 5000,
type = "unbiased",
p = 2
)
Arguments
X |
The binary or real matrix on which to perform test of exchangeability. |
block_boundaries |
Vector denoting the positions where a new block of non-independent features starts. Default is NULL. |
block_labels |
Length |
largeP |
Boolean indicating whether to use large |
largeN |
Boolean indicating whether to use large |
nruns |
Resampling number for exact test. Default is 5000. |
type |
Either an unbiased estimate of ('unbiased', default), or valid, but biased estimate of, ('valid') p-value
(see Hemerik and Goeman, 2018), or both ('both'). Default is 'unbiased'. Note that unbiased estimate can return |
p |
The power |
Details
Automatically detects if dataset is binary, and runs the Hamming
distance version of test if so. Otherwise, computes the squared
Euclidean distance between samples and evaluates whether the
variance of Euclidean distances, V
, is atypically large under the
null hypothesis of exchangeability. Note the user may tweak the
choice of power p
if they prefer an l_p^p
distance other than Euclidean.
Under the hood, the variance statistic, V
, is computed efficiently.
Moreover, the user can specify their choice of block permutations,
large P
asymptotics, or large P
and large N
asymptotics. The latter two
return reasonably accurate p-values for moderately large dimensionalities.
User recommendations: When the number of independent blocks B
or number of
independent features P
is at least 50, it is safe to use large P
asymptotics.
If P
or B
is small, however, stick with permutations.
Dependencies: All functions in auxiliary.R
Value
The p-value to be used to test the null hypothesis of exchangeability.
Examples
# Example 1 (get p-value of small matrix with independent features using exact test)
suppressWarnings(require(doParallel))
# registerDoParallel(cores = 2)
X1 <- matrix(nrow = 5, ncol = 10, rbinom(50, 1, 0.5)) # binary matrix, small
getPValue(X1) # perform exact test with 5000 permutations
# should be larger than 0.05
# Example 2 (get p-value of high-dim matrix with independent features using asymptotic test)
X2 <- matrix(nrow = 10, ncol = 1000, rnorm(1e4)) # real matrix, large enough
getPValue(X2, p = 2, largeP = TRUE) # very fast
# should be larger than 0.05
# getPValue(X2, p = 2) # slower, do not run (Output: 0.5764)
# Example 3 (get p-value of high-dim matrix with partitionable features using exact test)
X3 <- matrix(nrow = 10, ncol = 1000, rbinom(1e4, 1, 0.5))
getPValue(X3, block_labels = rep(c(1,2,3,4,5), 200))
# Warning message: # there are features that have zero variation (i.e., all 0s or 1s)
# In getPValue(X3, block_labels = rep(c(1, 2, 3, 4, 5), 200)) :
# There exist columns with all ones or all zeros for binary X.
# Example 4 (get p-value of high-dim matrix with partitionable features using asymptotic test)
## This elaborate example generates binarized versions of time series data.
# Helper function to binarize a marker
# by converting z-scores to {0,1} based on
# standard normal quantiles
binarizeMarker <- function(x, freq, ploidy) {
if (ploidy == 1) {
return((x > qnorm(1-freq)) + 0)
} else if (ploidy == 2) {
if (x <= qnorm((1-freq)^2)) {
return(0)
} else if (x <= qnorm(1-freq^2)) {
return(1)
} else return(2)
} else {
cat("Specify valid ploidy number, 1 or 2")
}
}
getAutoRegArray <- function(B, N, maf_l = 0.38, maf_u = 0.5, rho = 0.5, ploid = 1) {
# get minor allele frequencies by sampling from uniform
mafs <- runif(B, min = maf_l, max = maf_u)
# get AR array
ar_array <- t(replicate(N, arima.sim(n = B, list(ar=rho))))
# theoretical column variance
column_var <- 1/(1-rho^2)
# rescale so that variance per marker is 1
ar_array <- ar_array / sqrt(column_var)
# rescale each column of AR array
for (b in 1:B) {
ar_array[,b] <- sapply(ar_array[,b],
binarizeMarker,
freq = mafs[b],
ploidy = ploid)
}
return(ar_array)
}
## Function to generate the data array with desired number of samples
getExHaplotypes <- function(N) {
array <- do.call("cbind",
lapply(1:50, function(x) {getAutoRegArray(N, B = 20)}))
return(array)
}
## Generate data and run test
X4 <- getExHaplotypes(10)
getPValue(X4, block_boundaries = seq(from = 1, to = 1000, by = 25), largeP = TRUE)
# stopImplicitCluster()