R: The Knockoff Filter

knockoff.filter {knockoff}

R Documentation

The Knockoff Filter

Description

This function runs the Knockoffs procedure from start to finish, selecting variables relevant for predicting the outcome of interest.

Usage

knockoff.filter(
  X,
  y,
  knockoffs = create.second_order,
  statistic = stat.glmnet_coefdiff,
  fdr = 0.1,
  offset = 1
)

Arguments

`X`	n-by-p matrix or data frame of predictors.
`y`	response vector of length n.
`knockoffs`	method used to construct knockoffs for the `X` variables. It must be a function taking a n-by-p matrix as input and returning a n-by-p matrix of knockoff variables. By default, approximate model-X Gaussian knockoffs are used.
`statistic`	statistics used to assess variable importance. By default, a lasso statistic with cross-validation is used. See the Details section for more information.
`fdr`	target false discovery rate (default: 0.1).
`offset`	either 0 or 1 (default: 1). This is the offset used to compute the rejection threshold on the statistics. The value 1 yields a slightly more conservative procedure ("knockoffs+") that controls the false discovery rate (FDR) according to the usual definition, while an offset of 0 controls a modified FDR.

Details

This function creates the knockoffs, computes the importance statistics, and selects variables. It is the main entry point for the knockoff package.

The parameter knockoffs controls how knockoff variables are created. By default, the model-X scenario is assumed and a multivariate normal distribution is fitted to the original variables X. The estimated mean vector and the covariance matrix are used to generate second-order approximate Gaussian knockoffs. In general, the function knockoffs should take a n-by-p matrix of observed variables X as input and return a n-by-p matrix of knockoffs. Two default functions for creating knockoffs are provided with this package.

In the model-X scenario, under the assumption that the rows of X are distributed as a multivariate Gaussian with known parameters, then the function create.gaussian can be used to generate Gaussian knockoffs, as shown in the examples below.

In the fixed-X scenario, one can create the knockoffs using the function create.fixed. This requires n \geq p and it assumes that the response Y follows a homoscedastic linear regression model.

For more information about creating knockoffs, type ??create.

The default importance statistic is stat.glmnet_coefdiff. For a complete list of the statistics provided with this package, type ??stat.

It is possible to provide custom functions for the knockoff constructions or the importance statistics. Some examples can be found in the vignette.

Value

An object of class "knockoff.result". This object is a list containing at least the following components:

`X`	matrix of original variables
`Xk`	matrix of knockoff variables
`statistic`	computed test statistics
`threshold`	computed selection threshold
`selected`	named vector of selected variables

References

Candes et al., Panning for Gold: Model-free Knockoffs for High-dimensional Controlled Variable Selection, arXiv:1610.02351 (2016). https://web.stanford.edu/group/candes/knockoffs/index.html

Barber and Candes, Controlling the false discovery rate via knockoffs. Ann. Statist. 43 (2015), no. 5, 2055–2085.

Examples

set.seed(2022)
p=100; n=80; k=15
mu = rep(0,p); Sigma = diag(p)
X = matrix(rnorm(n*p),n)
nonzero = sample(p, k)
beta = 3.5 * (1:p %in% nonzero)
y = X %*% beta + rnorm(n)

# Basic usage with default arguments
result = knockoff.filter(X, y)
print(result$selected)

# Advanced usage with custom arguments
knockoffs = function(X) create.gaussian(X, mu, Sigma)
k_stat = function(X, Xk, y) stat.glmnet_coefdiff(X, Xk, y, nfolds=5)
result = knockoff.filter(X, y, knockoffs=knockoffs, statistic=k_stat)
print(result$selected)

[Package knockoff version 0.3.6 Index]