R: Calculate P-values Based on Multi-Splitting Approach

multi.split {hdi}

R Documentation

Calculate P-values Based on Multi-Splitting Approach

Description

Calculate p-values and confidence intervals based on the multi-splitting approach

Usage

multi.split(x, y, B = 100, fraction = 0.5, ci = TRUE, ci.level = 0.95,
            model.selector = lasso.cv,
            classical.fit = lm.pval, classical.ci = lm.ci,
            parallel = FALSE, ncores = getOption("mc.cores", 2L),
            gamma = seq(ceiling(0.05 * B) / B, 1 - 1 / B, by = 1 / B),
            args.model.selector = NULL, args.classical.fit = NULL,
            args.classical.ci = NULL,
            return.nonaggr = FALSE, return.selmodels = FALSE,
            repeat.max = 20,
            verbose = FALSE)

Arguments

`x`	numeric design matrix (without intercept).
`y`	numeric response vector.
`B`	the number of sample-splits, a positive integer.
`fraction`	a number in `(0,1)`, the fraction of data used at each sample split for the model selection process. The remaining data is used for calculating the p-values.
`ci`	logical indicating if a confidence interval should be calculated for each parameter.
`ci.level`	(if `ci` is true:) a number in `(0,1)`, typically close to 1, the desired coverage level of the confidence intervals.
`model.selector`	a `function` to perform model selection, with default `lasso.cv`. The function must have at least two arguments, `x` (the design matrix) and `y` (the response vector). Return value is the index vector of selected columns. See `lasso.cv` and `lasso.firstq` for an example. Additional arguments can be passed via `args.model.selector`.
`classical.fit`	a `function` to calculate (classical) p-values. Default is `lm.pval`. The function must have at least two arguments, `x` (the design matrix) and `y` (the response vector), and return the vector of p-values. See `lm.pval` for an example. Additional arguments can be passed through `args.classical.fit`.
`classical.ci`	a `function` to calculate (classical) confidence intervals. Default is `lm.ci`. The function must have at least 3 arguments, `x` (the design matrix), `y` (the response vector) and `level` (the coverage level), and return the matrix of confidence intervals. See `lm.ci` for an example. Additional arguments can be passed through `args.classical.ci`.
`parallel`	logical indicating if parallelization via `mclapply` should be used.
`ncores`	number of cores used for parallelization as `mc.cores` in `mclapply()`.
`gamma`	vector of gamma-values. In case gamma is a scalar, the value `Q_j` instead of `P_j` is being calculated (see reference below).
`args.model.selector`	named `list` of further arguments for function `model.selector`.
`args.classical.fit`	named `list` of further arguments for function `classical.fit`.
`args.classical.ci`	named `list` of further arguments for function `classical.ci`.
`return.nonaggr`	`logical` indicating if the unadjusted p-values be returned.
`return.selmodels`	`logical` indicating if the selected models (at each split) should be returned. Necessary for the `clusterGroupTest()` part of the result.
`repeat.max`	positive integer indicating the maximal number of split trials. Should not matter in regular cases, but necessary to prevent infinite loops in borderline cases.
`verbose`	should information be printed out while computing? (logical).

Value

`pval.corr`	Vector of multiple testing corrected p-values.
`gamma.min`	Value of gamma where minimal p-values was attained.
`clusterGroupTest`	Function to perform groupwise tests based on hierarchical clustering. You can either provide a distance matrix and clustering method or the output of hierarchical clustering from the function `hclust` as for `clusterGroupBound`. P-values are adjusted for multiple testing.

Author(s)

Lukas Meier, Ruben Dezeure, Jacopo Mandozzi

References

Meinshausen, N., Meier, L. and Bühlmann, P. (2009) P-values for high-dimensional regression. Journal of the American Statistical Association 104, 1671–1681.

Mandozzi, J. and Bühlmann, P. (2015) A sequential rejection testing method for high-dimensional regression with correlated variables. To appear in the International Journal of Biostatistics. Preprint arXiv:1502.03300

Examples

n <-  40 # a bit small, to keep example "fast"
p <- 256
x <- matrix(rnorm(n * p), nrow = n, ncol = p)
y <- x[,1] * 2 + x[,2] * 2.5 + rnorm(n)

## Multi-splitting with lasso.firstq as model selector function
## 'q' must be specified
fit.multi <- multi.split(x, y, model.selector = lasso.firstq,
                         args.model.selector = list(q = 10))
fit.multi
head(fit.multi$pval.corr, 10) ## the first 10 p-values
ci. <- confint(fit.multi)
head(ci.) # the first 6
stopifnot(all.equal(ci.,
     with(fit.multi, cbind(lci, uci)), check.attributes=FALSE))


## Use default 'lasso.cv' (slower!!) -- incl cluster group testing:
system.time(fit.m2 <- multi.split(x, y, return.selmodels = TRUE))# 9 sec (on "i7")
head(fit.m2$pval.corr) ## the first  6  p-values
head(confint(fit.m2))  ## the first  6  95% conf.intervals

## Now do clustergroup testing
clGTst <- fit.m2$clusterGroupTest
names(envGT <- environment(clGTst))# about 14
if(!interactive()) # if you are curious (and advanced):
  print(ls.str(envGT), max = 0)
stopifnot(identical(clGTst, envGT$clusterGroupTest))
ccc <- clGTst()
str(ccc)
ccc$hh   # the clustering
has.1.or.2 <- sapply(ccc$clusters,
                function(j.set) any(c(1,2) %in% j.set))
ccc$pval[ has.1.or.2] ## all very small
ccc$pval[!has.1.or.2] ## all 1

[Package hdi version 0.1-9 Index]