tpr.dag {HEMDAG} | R Documentation |
TPR-DAG ensemble variants
Description
Collection of the true-path-rule-based hierarchical learning ensemble algorithms and its variants.
TPR-DAG
is a family of algorithms on the basis of the choice of the bottom-up step adopted for the selection of
positive children (or descendants) and of the top-down step adopted to assure ontology-based predictions.
Indeed, in their more general form the TPR-DAG
algorithms adopt a two step learning strategy:
in the first step they compute a per-level bottom-up visit from leaves to root to propagate positive predictions across the hierarchy;
in the second step they compute a per-level top-down visit from root to leaves in order to assure the consistency of the predictions.
It is worth noting that levels (both in the first and second step) are defined in terms of the maximum distance from
the root node (see graph.levels
).
Usage
tpr.dag(
S,
g,
root = "00",
positive = "children",
bottomup = "threshold.free",
topdown = "gpav",
t = 0,
w = 0,
W = NULL,
parallel = FALSE,
ncores = 1
)
Arguments
S |
a named flat scores matrix with examples on rows and classes on columns. |
g |
a graph of class |
root |
name of the class that it is on the top-level of the hierarchy ( |
positive |
choice of the positive nodes to be considered in the bottom-up strategy. Can be one of the following values:
|
bottomup |
strategy to enhance the flat predictions by propagating the positive predictions from leaves to root. It can be one of the following values:
|
topdown |
strategy to make scores “hierarchy-aware”. It can be one of the following values: |
t |
threshold for the choice of positive nodes ( |
w |
weight to balance between the contribution of the node |
W |
vector of weight relative to a single example. If |
parallel |
a boolean value:
Use |
ncores |
number of cores to use for parallel execution. Set |
Details
The vanilla TPR-DAG
adopts a per-level bottom-up traversal of the DAG to correct the flat predictions
according to the following formula:
where are the positive children of
.
Different strategies to select the positive children
can be applied:
-
threshold-free strategy: the positive nodes are those children that can increment the score of the node
, that is those nodes that achieve a score higher than that of their parents:
-
threshold strategy: the positive children are selected on the basis of a threshold that can be selected in two different ways:
for each node a constant threshold
is a priori selected:
For instance if the predictions represent probabilities it could be meaningful to a priori select
.
the threshold is selected to maximize some performance metric
estimated on the training data, as for instance the Fmax or the AUPRC. In other words the threshold is selected to maximize some measure of accuracy of the predictions
on the training data for the class
with respect to the threshold
. The corresponding set of positives
is:
For instance
can be selected from a set of
through internal cross-validation techniques.
The weighted TPR-DAG
version can be designed by adding a weight to balance between the
contribution of the node
and that of its positive children
, through their convex combination:
If no weight is attributed to the children and the
TPR-DAG
reduces to the HTD-DAG
algorithm, since in this
way only the prediction for node is used in the bottom-up step of the algorithm. If
only the predictors
associated to the children nodes vote to predict node
. In the intermediate cases we attribute more importance to the predictor for the
node
or to its children depending on the values of
.
By combining the weighted and the threshold variant, we design the weighted-threshold variant.
Since the contribution of the descendants of a given node decays exponentially with their distance from the node itself, to enhance the
contribution of the most specific nodes to the overall decision of the ensemble we design the ensemble variant DESCENS
.
The novelty of DESCENS
consists in strongly considering the contribution of all the descendants of each node instead of
only that of its children. Therefore DESCENS
predictions are more influenced by the information embedded in the leaves nodes,
that are the classes containing the most informative and meaningful information from a biological and medical standpoint.
For the choice of the “positive” descendants we use the same strategies adopted for the selection of the “positive”
children shown above. Furthermore, we designed a variant specific only for DESCENS
, that we named DESCENS
-.
The
DESCENS
- variant balances the contribution between the “positives” children of a node
and that of its “positives” descendants excluding its children by adding a weight
:
where are the “positive” children of
and
the descendants of
without its children.
If
we consider only the contribution of the “positive” children of
; if
only the descendants that are not
children contribute to the score, while for intermediate values of
we can balance the contribution of
and
positive nodes.
Simply by replacing the HTD-DAG
top-down step (htd
) with the GPAV
approach (gpav
) we design the ISO-TPR
variant.
The most important feature of ISO-TPR
is that it maintains the hierarchical constraints by construction and it selects the closest
solution (in the least square sense) to the bottom-up predictions that obeys the True Path Rule.
Value
A named matrix with the scores of the classes corrected according to the chosen TPR-DAG
ensemble algorithm.
See Also
Examples
data(graph);
data(scores);
data(labels);
root <- root.node(g);
S.tpr <- tpr.dag(S, g, root, positive="children", bottomup="threshold.free",
topdown="gpav", t=0, w=0, W=NULL, parallel=FALSE, ncores=1);