R: Goodness of fit tests for potentially high-dimensional linear...

RPtest {RPtests}

R Documentation

Goodness of fit tests for potentially high-dimensional linear models

Description

Can test for the significance of (potentially large) groups of predictors and the presence of nonlinearity or heteroscedasticity in the context of both low and high-dimensional linear models. Outputs a p-value. Also allows for the calibration of arbitrary goodness of fit tests via specification of RPfunction.

Usage

RPtest(
  x,
  y,
  resid_type = c("Lasso", "OLS"),
  test = c("nonlin", "group", "hetero"),
  x_alt,
  RPfunction = NULL,
  B = 49L,
  rand_gen = rnorm,
  noise_matrix = NULL,
  mc.cores = 1L,
  nfolds = 5L,
  nperms = 2L,
  beta_est = NULL,
  resid_only = FALSE,
  output_all = FALSE,
  verbose = FALSE
)

Arguments

`x`	Input matrix with `nobs` rows, each an observation vector.
`y`	Response vector.
`resid_type`	Type of residuals used for the test (see details below). Use `Lasso` when the null model is high-dimensional; otherwise use `OLS`.
`test`	Type of departure from the linear model to test for (see details below). Ignored if `RPfunction` is given.
`x_alt`	If `test` is `group`, this gives the set of variables whose significance we wish to ascertain, after controlling for those in `x`. If `RPfunction` is given, it is the input matrix passed to the function `RPfunction`.
`RPfunction`	A residual prediction (RP) function that must permit calling as `RPfunction(x_alt, resid)` where `resid` is a numeric vector with `nobs` components. The output must be either a single number or a numeric vector (in the latter case `RPfunction` would encode a number of RP functions).
`B`	The number of bootstrap samples to use - note the p-value produced will always be at least 1/B.
`rand_gen`	A function to generate the simulated errors up to an unknown scale factor. It must permit calling as `rand_gen(nobs*B)`. Determines the form of errors in the null model. The default `rnorm` equates to a null of a (sparse) Gaussian linear model. Setting `rand_gen=NULL` resamples residuals to generate simulated errors and approximates a null of i.i.d. errors with unknown distribution.
`noise_matrix`	An optional matrix whose columns are the simulated errors to use. Note that `B` and `rand_gen` will be ignored if this is supplied.
`mc.cores`	The number of cores to use. Will always be 1 in Windows.
`nfolds`	Number of folds to use when performing cross-validation to obtain `beta_est`, the initial estimate of the vector of regression coefficients, via Lasso estimation.
`nperms`	Number of permutations of the data for which `nfolds` cross-validation is to be performed. Thus in total prediction errors on `nfolds*nperms` folds are averaged over.
`beta_est`	An optional user-supplied estimate.
`resid_only`	If `TRUE` only outputs the residuals without applying an RP function.
`output_all`	In addition to the p-value, gives further output (see Value below).
`verbose`	Whether to print addition information.

Details

The function works by first computing residuals from a regression of y on x. Next B sets of errors generated through rand_gen are added to a signal derived from beta_est and aritificial residuals are computed. The option resid_only=TRUE then outputs these residuals along with the original residuals, scaled to have l_2-norm squared equal to nobs. The residuals in question are OLS residuals when resid_type=OLS (case a - for use when the null hypothesis is low-dimensional so the number of columns of x is smaller than nobs-1), and square-root / scaled Lasso residuals otherwise (case b). The options for test then apply different functions to the residuals as described below.

nonlin: In case (a), the test statistic is the RSS (residual sum of squares) of a randomForest fit from regressing the residuals on to x; case (b) is similar but the OOB error is used and the regression is carried out on the equicorrelation set rather than all of x.
group: x_alt is first residualised with respect to x by (a) OLS or (b) sparse_proj. Then the RSS from Lasso fits from regressions of the residuals on to x_alt are used.
hetero: Uses the RSS from Lasso fits from regressions of the squared residuals to the equicorrelation set (b) or all of x (a).

Value

When resid_only=FALSE and output_all=FALSE, the output is a single p-value. Otherwise, a list with some of the following components is returned (resid_only=FALSE causes the last two components to be omitted):

p-value: p-value
beta_est: estimated vector of regression coefficients beta_est
sigma_est: set to 1 when resid_type=OLS; otherwise the normalised root-RSS derived from beta_est used in generated the simulated errors
resid: scaled residuals
resid_sim: simulated scaled residuals
test: the test statistic(s) - may be a vector if multiple RP functions are being used such as when test=group
test_sim: a list of simulated test statistics

References

Shah, R. D., Buhlmann, P. (2017) Goodness-of-fit tests for high dimensional linear models https://rss.onlinelibrary.wiley.com/doi/10.1111/rssb.12234

Examples

# Testing for nonlinearity
set.seed(1)
x <- scale(matrix(runif(100*200), 100, 200))
y <- x[, 1] + x[, 1]^4 + rnorm(nrow(x))
out <- RPtest(x, y, test="nonlin", B=9L, nperms=2, resid_type = "Lasso")

# Testing significance of a group
y <- x[, 1:5] %*% rep(1, 5) + x[, 151] + rnorm(nrow(x))
(out <- RPtest(x[, 1:150], y, test="group", x_alt = x[, 151:200], B=9L, nperms=1))

# Testing for heteroscedasticity
x <- scale(matrix(runif(250*100), 250, 100))
hetero_sig <- x[, 1] + x[, 2]
var_vec <- hetero_sig - min(hetero_sig) + 0.01
var_vec <- var_vec / mean(var_vec)
sd_vec <- sqrt(var_vec)
y <- x[, 1:5] %*% rep(1, 5) + sd_vec*rnorm(nrow(x))
(out <- RPtest(x, y, test="hetero", B=9L, nperms=1))

[Package RPtests version 0.1.5 Index]