verif_OT {OTrecod} | R Documentation |
verif_OT()
Description
This function proposes post-process verifications after data fusion by optimal transportation algorithms.
Usage
verif_OT(
ot_out,
group.class = FALSE,
ordinal = TRUE,
stab.prob = FALSE,
min.neigb = 1
)
Arguments
ot_out |
an otres object from |
group.class |
a boolean indicating if the results related to the proximity between outcomes by grouping levels are requested in output ( |
ordinal |
a boolean that indicates if |
stab.prob |
a boolean indicating if the results related to the stability of the algorithm are requested in output ( |
min.neigb |
a value indicating the minimal required number of neighbors to consider in the estimation of stability (1 by default). |
Details
In a context of data fusion, where information from a same target population is summarized via two specific variables and
(two ordinal or nominal factors with different number of levels
and
), never jointly observed and respectively stored in two distinct databases A and B,
Optimal Transportation (OT) algorithms (see the models
OUTCOME
, R_OUTCOME
, JOINT
, and R_JOINT
of the reference (2) for more details)
propose methods for the recoding of in B and/or
in A. Outputs from the functions
OT_outcome
and OT_joint
so provides the related predictions to in B and/or
in A,
and from these results, the function
verif_OT
provides a set of tools (optional or not, depending on the choices done by user in input) to estimate:
the association between
and
after recoding
the similarities between observed and predicted distributions
the stability of the predictions proposed by the algorithm
A. PAIRWISE ASSOCIATION BETWEEN AND
The first step uses standard criterions (Cramer's V, and Spearman's rank correlation coefficient) to evaluate associations between two ordinal variables in both databases or in only one database.
When the argument group.class = TRUE
, these informations can be completed by those provided by the function error_group
, which is directly integrate in the function verif_OT
.
Assuming that , and that one of the two scales of
or
is unknown, this function gives additional informations about the potential link between the levels of the unknown scale.
The function proceeds to this result in two steps. Firsty,
error_group
groups combinations of modalities of to build all possible variables
verifying
.
Secondly, the function studies the fluctuations in the association of
with each new variable
by using adapted comparisons criterions (see the documentation of
error_group
for more details).
If grouping successive classes of leads to an improvement in the initial association between
and
then it is possible to conclude in favor of an ordinal coding for
(rather than nominal)
but also to emphasize the consistency in the predictions proposed by the algorithm of fusion.
B. SIMILARITIES BETWEEN OBSERVED AND PREDICTED DISTRIBUTIONS
When the predictions of in B and/or
in A are available in the
datab
argument, the similarities between the observed and predicted probabilistic distributions of and/or
are quantified from the Hellinger distance (see (1)).
This measure varies between 0 and 1: a value of 0 corresponds to a perfect similarity while a value close to 1 (the maximum) indicates a great dissimilarity.
Using this distance, two distributions will be considered as close as soon as the observed measure will be less than 0.05.
C. STABILITY OF THE PREDICTIONS
These results are based on the decision rule which defines the stability of an algorithm in A (or B) as its average ability to assign a same prediction
of (or
) to individuals that have a same given profile of covariates
and a same given level of
(or
respectively).
Assuming that the missing information of in base A was predicted from an OT algorithm (the reasoning will be identical with the prediction of
in B, see (2) and (3) for more details), the function
verif_OT
uses the conditional probabilities stored in the
object estimatorZA
(see outputs of the functions OT_outcome
and OT_joint
) which contains the estimates of all the conditional probabilities of in A, given a profile of covariates
and given a level of
.
Indeed, each individual (or row) from A, is associated with a conditional probability
and averaging all the corresponding estimates can provide an indicator of the predictions stability.
The function OT_joint
provides the individual predictions for subject :
,
according to the the maximum a posteriori rule:
The function OT_outcome
directly deduces the individual predictions from the probablities computed in the second part of the algorithm (see (3)).
It is nevertheless common that conditional probabilities are estimated from too rare covariates profiles to be considered as a reliable estimate of the reality.
In this context, the use of trimmed means and standard deviances is suggested by removing the corresponding probabilities from the final computation.
In this way, the function provides in output a table (eff.neig
object) that provides the frequency of these critical probabilities that must help the user to choose.
According to this table, a minimal number of profiles can be imposed for a conditional probability to be part of the final computation by filling in the min.neigb
argument.
Notice that these results are optional and available only if the argument stab.prob = TRUE
.
When the predictions of in A and
in B are available, the function
verif_OT
provides in output, global results and results by database.
The res.stab
table can produce NA with OT_outcome
output in presence of incomplete shared variables: this problem appears when the prox.dist
argument is set to 0 and can
be simply solved by increasing this value.
Value
A list of 7 objects is returned:
nb.profil |
the number of profiles of covariates |
conf.mat |
the global confusion matrix between |
res.prox |
a summary table related to the association measures between |
res.grp |
a summary table related to the study of the proximity of |
hell |
Hellinger distances between observed and predicted distributions |
eff.neig |
a table which corresponds to a count of conditional probabilities according to the number of neighbors used in their computation (only the first ten values) |
res.stab |
a summary table related to the stability of the algorithm |
Author(s)
Gregory Guernec
References
Liese F, Miescke K-J. (2008). Statistical Decision Theory: Estimation, Testing, and Selection. Springer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi: 10.1080/01621459.2020.1775615
See Also
OT_outcome
, OT_joint
, proxim_dist
, error_group
Examples
### Example 1
#-----
# - Using the data simu_data
# - Studying the proximity between Y and Z using standard criterions
# - When Y and Z are predicted in B and A respectively
# - Using an outcome model (individual assignment with knn)
#-----
data(simu_data)
outc1 <- OT_outcome(simu_data,
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
dist.choice = "G", percent.knn = 0.90, maxrelax = 0,
convert.num = 8, convert.class = 3,
indiv.method = "sequential", which.DB = "BOTH", prox.dist = 0.30
)
verif_outc1 <- verif_OT(outc1)
verif_outc1
### Example 2
#-----
# - Using the data simu_data
# - Studying the proximity between Y and Z using standard criterions and studying
# associations by grouping levels of Z
# - When only Y is predicted in B
# - Tolerated distance between a subject and a profile: 0.30 * distance max
# - Using an outcome model (individual assignment with knn)
#-----
data(simu_data)
outc2 <- OT_outcome(simu_data,
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
dist.choice = "G", percent.knn = 0.90, maxrelax = 0, prox.dist = 0.3,
convert.num = 8, convert.class = 3,
indiv.method = "sequential", which.DB = "B"
)
verif_outc2 <- verif_OT(outc2, group.class = TRUE, ordinal = TRUE)
verif_outc2
### Example 3
#-----
# - Using the data simu_data
# - Studying the proximity between Y and Z using standard criterions and studying
# associations by grouping levels of Z
# - Studying the stability of the conditional probabilities
# - When Y and Z are predicted in B and A respectively
# - Using an outcome model (individual assignment with knn)
#-----
verif_outc2b <- verif_OT(outc2, group.class = TRUE, ordinal = TRUE, stab.prob = TRUE, min.neigb = 5)
verif_outc2b