sim_bound {bandsfdp} | R Documentation |
Simultaneous Band
Description
This function computes upper prediction bounds on the target wins among the
top k
hypotheses of TDC, for each k = 1,\ldots,n
where n
is the total number of hypotheses.
Usage
sim_bound(
labels,
gamma,
type,
d_max = NULL,
max_fdp = 0.5,
c = 0.5,
lambda = 0.5
)
simband(
labels,
gamma,
type,
d_max = NULL,
max_fdp = 0.5,
c = 0.5,
lambda = 0.5
)
Arguments
labels |
A vector of (ordered) labels. See details below. |
gamma |
The confidence parameter of the band. Typical values include
|
type |
A character string specifying which band to use. Must be one of
|
d_max |
An optional positive integer specifying the maximum number of decoy wins considered in calculating the bands. |
max_fdp |
A number specifying the maximum FDP considered by the user in
calculating the bands. Used to compute |
c |
Determines the ranks of the target score that are considered
winning. Defaults to |
lambda |
Determines the ranks of the target score that are
considered losing. Defaults to |
Details
In (single-decoy) TDC, each hypothesis is associated to a
winning score and a label (1 for a target win, -1 for a decoy win). This
function assumes that the hypotheses are ordered in decreasing order of
winning scores (with ties broken at random). The argument labels
,
therefore, must be ordered according to this rule.
This function also supports the extension of TDC that uses multiple
decoys. In that setup, the target score is competed with multiple decoy
scores and the rank of the target score after competition is used to determine whether the
hypothesis is a target win (label = 1), decoy win (-1) or uncounted (0).
The top c
proportion of ranks are considered winning, the bottom
1-lambda
losing, and all the rest uncounted.
The threshold of TDC is given by the formula (assuming hypotheses are ordered):
\max\{k : \frac{D_k + 1}{T_k \vee 1} \cdot \frac{c}{1-\lambda} \leq \alpha\}
where T_k
is the number of target wins among the top
k
hypotheses, and D_k
is the number of decoy wins similarly.
The argument gamma
sets a confidence level of 1-gamma
. Both
the uniform and standardized bands require pre-computed Monte Carlo
statistics, so only certain values of gamma
are available to use.
Commonly used confidence levels, like 0.95 and 0.99, are available.
We refer the reader to the README of this package for more details.
The argument d_max
controls the rate at which the returned bounds
increase: a larger d_max
results in a more conservative bound.
If, however, D_k + 1
exceeds d_max
for some index k
, each target
win thereafter is considered a false discovery when computing the bound.
Thus it is important that d_max
, chosen a priori, is large enough. Given
it is sufficiently large, the precise value of d_max
does not have a
significant effect on the resulting bounds (see https://arxiv.org/abs/2302.11837 for more details).
We recommend setting d_max = NULL
so that it is computed automatically
using max_fdp
. This argument ensures that D_k + 1
never
exceeds d_max
when the (non-interpolated) FDP bound on the top
k
hypotheses is less than max_fdp
.
Value
A vector of upper prediction bounds on the FDP of target wins among
the top k
hypotheses for each k = 1,\ldots,n
where n
is the total number of hypotheses.
References
Ebadi et al. (2022), Bounding the FDP in competition-based control of the FDR https://arxiv.org/abs/2302.11837.
Examples
if (requireNamespace("fdpbandsdata", quietly = TRUE)) {
set.seed(123)
labels <- c(
rep(1, 250),
sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)),
sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)),
sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9))
)
gamma <- 0.05
head(sim_bound(labels, gamma, "stband"))
}