TwoSampleTest.HD {TwoSampleTest.HD} | R Documentation |
A two-sample test for the equality of distributions for high-dimensional data
Description
Performs the four tests of equality of the p marginal distributions for two groups proposed in Cousido- Rocha et al.(2018). The methods have been designed for the low sample size and high dimensional setting. Furthermore, the possibility that the p variables in each data set can be weakly dependent is considered. The function also reports a set of p permutation p-values, each of which is derived from testing the equality of distributions in the two groups for each of the variables separately. These p-values are useful when a proposed test rejects the global null hypothesis since it makes it possible to identify which variables have contributed to this significance.
Usage
TwoSampleTest.HD(
X,
Y,
method = c("spect", "spect_ind", "boot", "us", "us_ind", "perm"),
I.permutation.p.values = FALSE,
b_I.permutation.p.values = c("global", "individual")
)
Arguments
X |
A matrix where each row is one of the p-samples in the first group. |
Y |
A matrix where each row is one of the p-samples in the second group. |
method |
the two-sample test. By default the “us” method is computed. See details. |
I.permutation.p.values |
Logical. Default is FALSE. A variable indicating whether to compute the permutation p-values or not when the selected method is not “perm”. See details. |
b_I.permutation.p.values |
The method used to compute the individual statistics on which are based the permutation p-values. Default is “global”. See details. |
Details
The function implements the two-sample tests proposed by Cousido-Rocha, et al. (2018). The methods “spect”, “boot” and “us” are based on a global statistic which is the average of p individual statistics corresponding to each of the p variables. Each of these individual statistics measures the difference between the empirical characteristic functions computed from the two samples. An alternative expression shows that each statistic is essentially the intergrated squared difference between kernel density estimates. The global statistic (average) is standardized using one of three different variance estimators. The method “spect” uses a variance estimator based on spectral analysis, the method “boot” implements the block bootstrap to estimate the variance and the method “us” employs a variance estimator derived from U-statistic theory (more details in Cousido-Rocha et al., 2018). The methods “spect” and “boot” are suitable under some assumptions including that the sequence of individual statistics that define the global statistic is strictly stationary, whereas the method “us” avoids this assumption. However the methods “spect” and “boot” have been checked in simulations and they perform well even when the stationarity assumption is violated. The methods “spect” and “us” have their corresponding versions for independent data (“spect ind” and “us ind”), for which the variance estimators are simplified taking into acount the independence of the variables. The asymptotic normality (when p tends to infinity) of the standardized version of the statistic is used to compute the corresponding p-value. On the other hand, Cousido-Rocha et al. (2018) also proposed the method “perm” whose global statistic is the average of the permutation p-values corresponding to the individual statistics mentioned above. This method assumes that the sequence of p-values is strictly stationary, however in simulations it seems that it performs well when this assumption does not hold. In addition to providing an alternative global test, these p-values can be used when the global null hypothesis is rejected and one wishes to identify which of the p variables have contributed to that rejection. The global statistic depends on a parameter which plays a role similar to that of a smoothing parameter or bandwidth in kernel density estimation. For the four global tests this parameter is estimated using the information from all the variables or features. For the individual statistics from which the permutation p-values are computed, there are two possibilities: (i) use the value employed in the global test (b_I.permutation.p.values=“global”), (ii) estimate this parameter for each variable separately using only its sample information (b_I.permutation.p.values=“individual”)).
Value
A list containing the following components:
standardized statistic: |
the value of the standardized statistic. |
p.value: |
the p-value for the test. |
statistic: |
the value of the statistic. |
variance: |
the value of the variance estimator. |
p: |
number of samples or populations. |
n: |
sample size in the first group. |
m: |
sample size in the second group. |
method: |
a character string indicating which two sample test is performed. |
I.statistics: |
the p individual statistics. |
I.permutation.p.values: |
the p individual permutation p-values. |
data.name: |
a character string giving the name of the data. |
Author(s)
Marta Cousido-Rocha
José Carlos Soage González
Jacobo de Uña-Álvarez
Jeffrey D. Hart
References
Cousido-Rocha, M., de Uña-Álvarez J., and Hart, J. (2018). A two-sample test for the equality of marginal distributions for high-dimensional data. Preprint.
Examples
# We consider a simulated data example. We have simulated the following situation.
# We have two groups, for example, 7 patients with tumor 1 and 7 patients with tumor 2.
# For each patient 1000 variables are measured, for example, gene expression levels.
# Besides, the distributions of 100 of the variables are different in the two groups,
# and the differences are in terms of location. The variables are independent to
# simplify the generation of the data sets.
p <- 1000
n = m = 7
inds <- sample(1:4, p, replace = TRUE)
X <- matrix(rep(0, n * p), ncol = n)
for (j in 1:p){
if (inds[j] == 1){
X[j, ] <- rnorm(n)
}
if (inds[j] == 2){
X[j, ] <- rnorm(n, sd = 2)
}
if (inds[j] == 3){
X[j, ] <- rnorm(n, mean = 1)
}
if (inds[j] == 4){
X[j, ] <- rnorm(n, mean = 1, sd = 2)
}
}
rho <- 0.1
ind <- sample(1:p, rho * p)
li <- length(ind)
indsy <- inds
for (l in 1:li){
if (indsy[ind[l]]==1){
indsy[ind[l]]=3
} else {
if (indsy[ind[l]]==2){
indsy[ind[l]]=4
} else {
if (indsy[ind[l]]==3){
indsy[ind[l]]=1
} else {
indsy[ind[l]] = 2
}
}
}
}
Y <- matrix(rep(0, m * p), ncol = m)
for (j in 1:p){
if (indsy[j] == 1){
Y[j,] <- rnorm(m)}
if (indsy[j] == 2){
Y[j, ] <- rnorm(m, sd = 2)
}
if (indsy[j]==3){
Y[j, ] <- rnorm(m, mean = 1)
}
if (indsy[j] == 4){
Y[j,] <- rnorm(m, mean = 1, sd = 2)
}
}
# Our interest is to test the null hypothesis that the distribution of each of the 1000 variables
# is the same in the two groups.
# We use for this purpose the four methods proposed in Cousido-Rocha et al. (2018).
res1 <- TwoSampleTest.HD(X, Y, method = "spect")
res1
res2 <- TwoSampleTest.HD(X, Y, method = "boot")
res2
res3 <- TwoSampleTest.HD(X, Y, method = "us")
res3
res4 <- TwoSampleTest.HD(X, Y, method = "perm")
res4
# The four methods reject the global null hypothesis.
# Hence, we use the individual permutation p-values
# to identify which variables are not equally distributed in the two groups.
pv<-res4$I.permutation.p.values
# Applying a multiple testing procedure to these p-values
# we can detect the variables with different distributions for the two groups.
# The following plot of the individual permutation p-values is also informative.
# We remark in red the 100 smallest p-values.
pv_sort <- sort(pv)
cri <- pv_sort[100]
ind <- which(pv <= cri)
plot(1:p, pv, main = "Individual permutation p-values",
xlab = "Variables", ylab = "p-values")
points(ind, pv[ind], col = "red")