R: General multiple tests for univariate and multivariate...

gmtFD {gmtFD}

R Documentation

General multiple tests for univariate and multivariate functional data

Description

The function gmtFD() calculates the statistical tests based on globalizing and supremum pointwise Hotelling's T^2-test statistics (GPH and SPH) for the global null hypothesis and multiple local null hypotheses. Respective p-values are obtained by a parametric bootstrap strategy.

Usage

gmtFD(
  x,
  h,
  blocks_contrasts,
  n_boot = 1000,
  alpha = 0.05,
  parallel = FALSE,
  n_cores = NULL,
  multi_gen = FALSE
)

Arguments

`x`	a list of `k` elements corresponding to groups. Each element representing a group is a list of `p` elements corresponding to functional variables, and each such element (representing a functional variable) is a matrix of size `n_i\times ntp` of descrete observations in design time points. `ntp` denotes a number of design time points.
`h`	contrast matrix. For contrast matrices based on Dunnett’s and Tukey’s contrasts, it can be created by the `contr_mat()` function from the package `GFDmcv` (see examples).
`blocks_contrasts`	a vector with blocks of contrasts labels. The integer labels (from 1 to a number of blocks of contrasts) should be used in non-decreasing order. For univariate case (`p=1`), this is the vector representing group labels, e.g., for `k=3` and Tukey's contrasts, we have `c(1, 2, 3)`. For multivariate case (`p>1`), we have more possibilities. Mainly, we compare samples (groups) rather than particular functional variables, and then we use the vector grouping all variables in samples taking into account contrasts, e.g., for `k=4`, `p=2` and Tukey's contrasts, we have `c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)`, where pairs `(i,i)` for `i=1,...,6` represent the two variables of observations in the `i`th contrast. See examples.
`n_boot`	number of bootstrap samples.
`alpha`	significance level.
`parallel`	a logical indicating whether to use parallelization.
`n_cores`	if `parallel = TRUE`, a number of processes used in parallel computation. Its default value means that it will be equal to the number of cores of a computer used.
`multi_gen`	a logical indicating of whether to use separate multiple generations of Gaussian processes for the parametric bootstrap tests. The default is FALSE, which means that the processes will be generated once in a big matrix. This method is much faster, but for larger `n=n_1+\dots+n_k` and `p` the generated data can be too large for RAM. In such a case, we suggest using separate generation (`multi_gen = TRUE`), which is slower, but possible to calculate.

Details

The function gmtFD() concerns the tests for the heteroscedastic contrast testing problem for univariate and multivariate functional data. The details are presented in Munko et al. (2023, 2024), but here we present some summary of the problem and its solutions implemented in the package.

Suppose we have k independent functional samples \mathbf{x}_{i1},\dots,\mathbf{x}_{in_i}, which consist of independent and identically distributed p-dimensional stochastic processes defined on interval \mathcal{T} with mean function vector \boldsymbol{\eta}_i and covariance function \boldsymbol{\Gamma}_i for each i\in\{1,\dots,k\}. Note that the covariance functions of the different groups may differ from each other, i.e., heteroscedasticity is explicitly allowed.

We consider the null and alternative hypothesis

\mathcal H_0: \mathbf{H}\boldsymbol{\eta}(t) =\mathbf{0}_r \text{ for all } t\in \mathcal{T} \quad \text{vs.} \quad \mathcal H_1: \mathbf{H}\boldsymbol{\eta}(t) \neq \mathbf{0}_r \text{ for some } t\in \mathcal{T},

where \mathbf{H} \in \mathbb{R}^{r \times pk} denotes a known contrast matrix, i.e., \mathbf{H}\mathbf{1}_{pk} = \mathbf{0}_r, and \boldsymbol{\eta} := (\boldsymbol{\eta}_1^{\top},\dots,\boldsymbol{\eta}_k^{\top})^{\top} is the vector of the mean functions. The formulation of this testing framework is very general and contains many special cases like the analysis of variance for univariate and multivariate functional data (FANOVA and FMANOVA) problems.

For univariate functional data (p=1), we may choose \mathbf{H} = \mathbf{P}_k for the one-way FANOVA problem to test the null hypothesis of no main effect, where \mathbf{P}_k:=\mathbf{I}_k-\mathbf{J}_k/k with \mathbf{I}_k \in\mathbb{R}^{k\times k} denoting the unit matrix and \mathbf{J}_k := \mathbf{1}_k\mathbf{1}_k^{\top} \in\mathbb{R}^{k\times k} denoting the matrix of ones. However, there are different possible choices of the contrast matrix \mathbf{H} which lead to this global null hypothesis. Many-to-one comparisons can be considered by choosing Dunnett's contrast matrix \mathbf{H} = [-\mathbf{1}_{k-1}, \mathbf{I}_{k-1}], where the mean functions \eta_2,\dots,\eta_k are compared to the mean function \eta_1 of the first group regarding the different contrasts. To compare all pairs of mean functions \eta_{i_1},\eta_{i_2}, i_1,i_2 \in\{1,\dots,k\} with i_1 \neq i_2, the Tukey's contrast matrix:

\mathbf{H} = \begin{bmatrix} -1 & 1 & 0 & 0 & \cdots & \cdots & 0 \\ -1 & 0 & 1 & 0 &\cdots & \cdots & 0 \\ \vdots & \vdots &\vdots & \vdots & \ddots & \vdots & \vdots \\ -1 & 0 & 0 & 0& \cdots & \cdots & 1\\ 0 & -1 & 1 & 0& \cdots & \cdots & 0 \\ 0 & -1 & 0 & 1& \cdots & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & -1 & 1 \end{bmatrix} \in \mathbb{R}^{k(k-1)/2 \times k}

can be used.

For multivariate functional data (p>1), we extend the matrices for p=1 given above as follows: \mathbf{H}=\mathbf{H}_1\otimes \mathbf{I}_p, where \mathbf{H}_1 is the contrast matrix for the univariate case. For more examples of contrast matrices in different settings, see Merle et al. (2023, 2024).

For this testing problem, we consider the pointwise Hotelling's T^2-test statistic

\mathrm{PH}_{n,\mathbf{H}}(t):= n(\mathbf{H}\boldsymbol{\widehat\eta}(t))^{\top} (\mathbf{H}\mathbf{\widehat \Gamma}(t,t) \mathbf{H}^{\top})^+ \mathbf{H} \boldsymbol{\widehat\eta}(t)

for all t \in\mathcal{T}, where \boldsymbol{\widehat\eta} := (\boldsymbol{\widehat\eta}_1^{\top},\dots,\boldsymbol{\widehat\eta}_k^{\top})^{\top} denotes the vector of all mean function estimators, \mathbf{A}^+ denotes the Moore-Penrose inverse of the matrix \mathbf{A}, and

\boldsymbol{\widehat{\Gamma}}(t,s):= \mathrm{diag}\left( \frac{n}{n_1}\widehat{\boldsymbol{\Gamma}}_1(t,s), \ldots,\frac{n}{n_k}\widehat{\boldsymbol{\Gamma}}_k(t,s)\right),

n=n_1+\dots+n_k, \widehat{\boldsymbol{\Gamma}}_i(t,s) is the sample covariance function for the i-th group, i\in\{1,\dots,k\}. Based on this pointwise Hotelling's T^2-test statistic, we construct the globalizing and supremum of pointwise Hotelling's T^2-test (GPH and SPH) statistics by integrating and supremum over the pointwise Hotelling's T^2-test statistic, that is

I_{n}(\mathbf{H}) := \int_{\mathcal{T}} \mathrm{PH}_{n,\mathbf{H}}(t) \,\mathrm{ d }t,\ \ T_{n}(\mathbf{H}) := \mathrm{sup}_{t\in\mathcal{T}} \mathrm{PH}_{n,\mathbf{H}}(t).

We consider the parametric bootstrap test based on these test statistics. However, for better post hoc analysis, we also consider the multiple contrast testing procedures. The main idea of multiple contrast tests is to split up the global null hypothesis with matrix \mathbf{H}= [\mathbf{H}_1^{\top}, \dots, \mathbf{H}_R^{\top}]^{\top} into R matrices \mathbf{H}_{\ell}\in\mathbb{R}^{r_{\ell}\times k} with \mathrm{rank}(\mathbf{H}_{\ell})\geq 1, \ell\in\{1,\dots,R\}, where R,r_{\ell}\in\{1,\dots,r\}, and \sum_{\ell=1}^Rr_{\ell}=r. This leads to the multiple testing problem with null hypotheses

\mathcal H_{0,{\ell}} : \; \mathbf{H}_{\ell} \boldsymbol{\eta}(t) = \mathbf{0}_{r_{\ell}} \;\text{ for all }t\in\mathcal{T}, \text{for }\ell\in \{1,\ldots,R\}.

To verify this family of null hypotheses, we adopt two approaches. First, we simply apply the above test to each hypothesis \mathcal H_{0,{\ell}}, and the resulting p-values are then corrected by the Bonferroni's method. However, this approach, denoted in the package as GPH and SPH, may give conservative test and loss of power. Thus, we also consider the test adopting the idea for the construction of simultaneous confidence bands proposed by Buhlmann (1998). This test is denoted by mGPH and mSPH in the package and is a more powerful solution than the GPH and SPH procedures, which was shown in Munko et al. (2023, 2024).

Note that the value of the test statistics for the mGPH and mSPH tests for global hypotheses are equal to

\max_{\ell\in\{1,\ldots,R\}}\frac{I_{n}(\mathbf{H}_{\ell})}{q_{GPH,\ell,\widetilde{\beta}}^{\mathcal{P}}}\text{ and } \max_{\ell\in\{1,\ldots,R\}}\frac{T_{n}(\mathbf{H}_{\ell})}{q_{SPH,\ell,\widetilde{\beta}}^{\mathcal{P}}},

respectively, where q_{GPH,\ell,\widetilde{\beta}}^{\mathcal{P}} and q_{SPH,\ell,\widetilde{\beta}}^{\mathcal{P}} are the quantiles calculated using the adaptation of the method by Buhlmann (1998). The critical value for it is always 1.

Please have a look at a summary function designed for this package. It can be used to simplify the output of gmtFD() function.

Value

A list of class multifmanova containing the following components:

`res_global`	a data frame containing the results for testing the global null hypothesis, i.e., test statistics and `p`-values.
`res_multi`	all results of multiple contrasts tests for particular hypothesis in a contrast matrix `h`, i.e., test statistics, critical values and `p`-values.
`k`	a number of groups.
`p`	a number of variables.
`ntp`	a number of design time points.
`n`	a vector of sample sizes.
`h`	an argument `h`.
`n_boot`	an argument `n_boot`.
`alpha`	an argument `alpha`.
`multi_gen`	an argument `multi_gen`.

References

Buhlmann P. (1998) Sieve bootstrap for smoothing in nonstationary time series. Annals of Statistics 26, 48-83.

Dunnett C. (1955) A multiple comparison procedure for comparing several treatments with a control. Journal of the American Statistical Association 50, 1096-1121.

Munko M., Ditzhaus M., Pauly M., Smaga L., Zhang J.T. (2023) General multiple tests for functional data. Preprint https://arxiv.org/abs/2306.15259

Munko M., Ditzhaus M., Pauly M., Smaga L. (2024) Multiple comparison procedures for simultaneous inference in functional MANOVA. Preprint https://arxiv.org/abs/2406.01242

Tukey J.W. (1953) The problem of multiple comparisons. Princeton University.

Examples

# Some of the examples may run some time.

# Canadian weather data set
# There are three samples of mean temperature and precipitations for
# fifteen weather stations in Western Canada, another fifteen in Eastern Canada, 
# and the remaining five in Northern Canada.

# one functional variable - temperature
library(fda)
data_set_t <- t(CanadianWeather$dailyAv[,, "Temperature.C"])
# number of samples
k <- 3
# number of variables
p <- 1
# preparing data set
gr_label <- rep(c(1, 2, 3), c(15, 15, 5))
data_set <- list(list(data_set_t[gr_label == 1, ]),
                 list(data_set_t[gr_label == 2, ]),
                 list(data_set_t[gr_label == 3, ]))
# trajectories of mean temperature and precipitation
oldpar <- par(mar = c(4, 4, 2, 0.1))
matplot(t(data_set_t), type = "l", col = gr_label, lty = 1,
        xlab = "Day", ylab = "Temperature (C)",
        main = "Canadian weather data set")
legend("bottom", legend = c("Western Canada", "Eastern Canada", "Northern Canada"),
       col = 1:3, lty = 1)
par(oldpar)

# Tukey's contrast matrix
h_tukey <- GFDmcv::contr_mat(k, type = "Tukey")
h_tukey_m <- kronecker(h_tukey, diag(p))
# vector of blocks of contrasts labels
blocks_contrasts <- rep(1:(nrow(h_tukey_m) / p), each = p)

# testing without parallel computing
res <- gmtFD(data_set, h_tukey_m, blocks_contrasts)
summary(res, digits = 3)
# plots for pointwise Hotelling's T^2-test statistics
oldpar <- par(mfrow = c(2, 2), mar = c(4, 2, 2, 0.1))
plot(ph_test_statistic(data_set, h_tukey_m), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_tukey_m))),
     main = "Global hypothesis", xlab = "Day")
plot(ph_test_statistic(data_set, matrix(h_tukey_m[1, ], 1)), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_tukey_m))),
     main = "Contrast 1", xlab = "Day")
plot(ph_test_statistic(data_set, matrix(h_tukey_m[2, ], 1)), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_tukey_m))),
     main = "Contrast 2", xlab = "Day")
plot(ph_test_statistic(data_set, matrix(h_tukey_m[3, ], 1)), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_tukey_m))),
     main = "Contrast 3", xlab = "Day")
par(oldpar)

# Dunnett's contrast matrix
h_dunnett <- GFDmcv::contr_mat(k, type = "Dunnett")
h_dunnett_m <- kronecker(h_dunnett, diag(p))
# vector of blocks of contrasts labels
blocks_contrasts <- rep(1:(nrow(h_dunnett_m) / p), each = p)

# testing without parallel computing
res <- gmtFD(data_set, h_dunnett_m, blocks_contrasts)
summary(res, digits = 3)
# plots for pointwise Hotelling's T^2-test statistics
oldpar <- par(mfrow = c(3, 1), mar = c(4, 2, 2, 0.1))
plot(ph_test_statistic(data_set, h_dunnett_m), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_dunnett_m))),
     main = "Global hypothesis", xlab = "Day")
plot(ph_test_statistic(data_set, matrix(h_dunnett_m[1, ], 1)), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_dunnett_m))),
     main = "Contrast 1", xlab = "Day")
plot(ph_test_statistic(data_set, matrix(h_dunnett_m[2, ], 1)), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_dunnett_m))),
     main = "Contrast 2", xlab = "Day")
par(oldpar)

# two functional variables - temperature and precipitation
library(fda)
data_set_t <- t(CanadianWeather$dailyAv[,, "Temperature.C"])
data_set_p <- t(CanadianWeather$dailyAv[,, "Precipitation.mm"])
# number of samples
k <- 3
# number of variables
p <- 2
# preparing data set
gr_label <- rep(c(1, 2, 3), c(15, 15, 5))
data_set <- list(list(data_set_t[gr_label == 1, ], data_set_p[gr_label == 1, ]),
                 list(data_set_t[gr_label == 2, ], data_set_p[gr_label == 2, ]),
                 list(data_set_t[gr_label == 3, ], data_set_p[gr_label == 3, ]))
# trajectories of mean temperature and precipitation
oldpar <- par(mfrow = c(1, 2), mar = c(4, 4, 2, 0.1))
matplot(t(data_set_t), type = "l", col = gr_label, lty = 1,
        xlab = "Day", ylab = "Temperature (C)",
        main = "Canadian weather data set")
legend("bottom", legend = c("Western Canada", "Eastern Canada", "Northern Canada"),
       col = 1:3, lty = 1)
matplot(t(data_set_p), type = "l", col = gr_label, lty = 1,
        xlab = "Day", ylab = "Precipitation (mm)",
        main = "Canadian weather data set")
legend("topleft", legend = c("Western Canada", "Eastern Canada", "Northern Canada"),
       col = 1:3, lty = 1)
par(oldpar)

# Tukey's contrast matrix
h_tukey <- GFDmcv::contr_mat(k, type = "Tukey")
h_tukey_m <- kronecker(h_tukey, diag(p))
# vector of blocks of contrasts labels
blocks_contrasts <- rep(1:(nrow(h_tukey_m) / p), each = p)

# testing without parallel computing
res <- gmtFD(data_set, h_tukey_m, blocks_contrasts)
summary(res, digits = 3)
# plots for pointwise Hotelling's T^2-test statistics
oldpar <- par(mfrow = c(2, 2), mar = c(4, 2, 2, 0.1))
plot(ph_test_statistic(data_set, h_tukey_m), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_tukey_m))),
     main = "Global hypothesis", xlab = "Day")
plot(ph_test_statistic(data_set, h_tukey_m[1:2, ]), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_tukey_m))),
     main = "Contrast 1", xlab = "Day")
plot(ph_test_statistic(data_set, h_tukey_m[3:4, ]), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_tukey_m))),
     main = "Contrast 2", xlab = "Day")
plot(ph_test_statistic(data_set, h_tukey_m[5:6, ]), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_tukey_m))),
     main = "Contrast 3", xlab = "Day")
par(oldpar)

# Dunnett's contrast matrix
h_dunnett <- GFDmcv::contr_mat(k, type = "Dunnett")
h_dunnett_m <- kronecker(h_dunnett, diag(p))
# vector of blocks of contrasts labels
blocks_contrasts <- rep(1:(nrow(h_dunnett_m) / p), each = p)

# testing without parallel computing
res <- gmtFD(data_set, h_dunnett_m, blocks_contrasts)
summary(res, digits = 3)
# plots for pointwise Hotelling's T^2-test statistics
oldpar <- par(mfrow = c(3, 1), mar = c(4, 2, 2, 0.1))
plot(ph_test_statistic(data_set, h_dunnett_m), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_dunnett_m))),
     main = "Global hypothesis", xlab = "Day")
plot(ph_test_statistic(data_set, matrix(h_dunnett_m[1, ], 1)), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_dunnett_m))),
     main = "Contrast 1", xlab = "Day")
plot(ph_test_statistic(data_set, matrix(h_dunnett_m[2, ], 1)), type = "l",
     ylim = c(0, max(ph_test_statistic(data_set, h_dunnett_m))),
     main = "Contrast 2", xlab = "Day")
par(oldpar)

# testing with parallel computing
library(doParallel)
blocks_contrasts <- rep(1:(nrow(h_tukey_m) / p), each = p)
res <- gmtFD(data_set, h_tukey_m, blocks_contrasts, parallel = TRUE, n_cores = 2)
summary(res, digits = 3)

[Package gmtFD version 0.1.0 Index]