select_pred {OTrecod} | R Documentation |
select_pred()
Description
Selection of a subset of non collinear predictors having relevant relationships with a given target outcome using a random forest procedure.
Usage
select_pred(
databa,
Y = NULL,
Z = NULL,
ID = 1,
OUT = "Y",
quanti = NULL,
nominal = NULL,
ordinal = NULL,
logic = NULL,
convert_num = NULL,
convert_class = NULL,
thresh_cat = 0.3,
thresh_num = 0.7,
thresh_Y = 0.2,
RF = TRUE,
RF_ntree = 500,
RF_condi = FALSE,
RF_condi_thr = 0.2,
RF_SEED = sample(1:1e+06, 1)
)
Arguments
databa |
a data.frame with a column of identifiers (of row or of database in the case of two concatened databases), an outcome, and a set of predictors. The number of columns can exceed the number of rows. |
Y |
the label of a first target variable with quotes |
Z |
the label of a second target variable with quotes when |
ID |
the column index of the database identifier (The first column by default) in the case of two concatened databases, a row identifier otherwise |
OUT |
a character that indicates the outcome to predict in the context of overlayed databases. By default, the outcome declared in the argument |
quanti |
a vector of integers corresponding to the column indexes of all the numeric predictors. |
nominal |
a vector of integers which corresponds to the column indexes of all the categorical nominal predictors. |
ordinal |
a vector of integers which corresponds to the column indexes of all the categorical ordinal predictors. |
logic |
a vector of integers indicating the indexes of logical predictors. No index remained by default |
convert_num |
a vector of integers indicating the indexes of quantitative variables to convert in ordered factors. No index remained by default. Each index selected has to be defined as quantitative in the argument |
convert_class |
a vector of integers indicating the number of classes related to each transformation of quantitative variable in ordered factor. The length of this vector can not exceed the length of the argument |
thresh_cat |
a threshold associated to the Cramer's V coefficient (= 0.30 by default) |
thresh_num |
a threshold associated to the Spearman's coefficient of correlation (= 0.70 by default) |
thresh_Y |
a threshold linked to the RF approach, that corresponds to the minimal cumulative percent of importance measure required to be kept in the final list of predictors. |
RF |
a boolean sets to TRUE (default) if a random forest procedure must be applied to select the best subset of predictors according to the outcome.Otherwise, only pairwise associations between predictors are used for the selection. |
RF_ntree |
the number of bootsrap samples required from the row datasource during the random forest procedure |
RF_condi |
a boolean specifying if the conditional importance measures must be assessed from the random forest procedure ( |
RF_condi_thr |
a threshold linked to (1 - pvalue) of an association test between each predictor |
RF_SEED |
an integer used as argument by the set.seed() for offsetting the random number generator (random integer by default). This value is only used for RF method. |
Details
The select_pred
function provides several tools to identify, on the one hand, the relationships between predictors, by detecting especially potential problems of collinearity, and, on the other hand, proposes a parcimonious subset of relevant predictors (of the outcome) using appropriate random forest procedures.
The function which can be used as a preliminary step of prediction in regression areas is particularly adapted to the context of data fusion by providing relevant subsets of predictors (the matching variables) to algorithms dedicated to the solving of recoding problems.
A. REQUIRED STRUCTURE FOR THE DATABASE
The expected input database is a data.frame that especially requires a specific column of row identifier and a target variable (or outcome) having a finite number of values or classes (ordinal, nominal or discrete type). Notice that if the chosen outcome is in numeric form, it will be automatically converted in ordinal type.
The number of predictors is not a constraint for select_pred
(even if, with less than three variables a process of variables selection has no real sense...), and can exceed the number of rows (no problem of high dimensionality here).
The predictors can be continuous (quantitative), boolean, nominal or ordinal with or without missing values.
In presence of numeric variables, users can decide to discretize them or a part of them by themselves beforehand. They can also choose to use the internal process directly integrated in the function. Indeed, to assist users in this task, two arguments called convert_num
and convert_class
dedicated to these transformations are available in input of the function.
These options make the function select_pred
particularly adapted to the function OT_joint
which only allows data.frame with categorical covariates.
With the argument convert_num
, users choose the continuous variables to convert and the related argument convert_class
specifies the corresponding number of classes chosen for each discretization.
It is the reason why these two arguments must be two vectors of indexes of same length. Nevertheless, an unique exception exists when convert_class
is equalled to a scalar S
. In this case, all the continuous predictors selected for conversion will be discretized with a same number of classes S.
By example, if convert_class = 4
, all the continuous variables specified in the convert_num
argument will be discretized by quartiles. Moreover, notice that missing values from incomplete predictors to convert are not taken into account during the conversion, and that each predictor specified in the argument convert_num
must be also specified in the argument quanti
.
In this situation, the label of the outcome must be entered in the argument Y
, and the arguments Z
and OUT
must keep their default values.
Finally, the order of the column indexes related to the identifier and the outcome have no importance.
For a better flexibility, the input database can also be the result of two overlayed databases.
In this case, the structure of the database must be similar to those observed in the datasets simu_data
and tab_test
available in the package with a column of database identifier, one target outcome by database (2 columns), and a subset of shared predictors.
Notice that, overlaying two separate databases can also be done easily using the function merge_dbs
beforehand.
The labels of the two outcomes will have to be specified in the arguments Y
for the top database, and in Z
for the bottom one.
Notice also that the function select_pred
deals with only one outcome at a time that will have to be specified in the argument OUT
which must be equalled to "Y" for the study of the top database or "Z" for the study of the bottom one.
Finally, whatever the structure of the database declared in input, each column index related to the database variable must be entered once (and only once) in one of the following four arguments: quanti
, nominal
, ordinal
, logic
.
B. PAIRWISE ASSOCIATIONS BETWEEN PREDICTORS
In a first step of process, select_pred
calculates standard pairwise associations between predictors according to their types.
Between categorical predictors (ordinal, nominal and logical): Cramer's V (and Bias-corrected Cramer's V, see (1) for more details) are calculated between categorical predictors and the argument
thres_cat
fixed the associated threshold beyond which two predictors can be considered as redundant. A similar process is done between the target variable and the subset of categorical variables which provides in output a first table ranking the top scoring predictors. This table summarizes the ability of each variable to predict the target outcome.Between continuous predictors: If the
ordinal
andlogic
arguments differ from NULL, all the corresponding predictors are beforehand converted in rank values. For numeric (quantitative), logical and ordinal predictors, pairwise correlations between ranks (Spearman) are calculated and the argumentthresh_num
fixed the related threshold beyond which two predictors can be considered as redundant. A similar process is done between the outcome and the subset of discrete variables which provides in output, a table ranking the top scoring predictor variates which summarizes their abilities to predict the target. In addition, the result of a Farrar and Glauber test is provided. This test is based on the determinant of the correlation matrix of covariates and the related null hypothesis of the test corresponds to an absence of collinearity between them (see (2) for more details about the method). In presence of a large number of numeric covariates and/or ordered factors, the approximate Farrar-Glauber test, based on the normal approximation of the null distribution is more adapted and its result is also provided in output. These two tests are highly sensitive and, by consequence, it suggested to consider these results as simple indicators of collinearity between predictors rather than an essential condition of acceptability.
If the initial number of predictors is not too important, these informations can be sufficient to the user for the visualization of potential problems of collinearity and for the selection of a subset of predictors (RF = FALSE
).
It is nevertheless often necessary to complete this visualization by an automatical process of selection like the Random Forest approach (see Breiman 2001, for a better understanding of the method) linked to the function select_pred
(RF = TRUE
).
C. RANDOM FOREST PROCEDURE
As a final step of the process, a random forest approach (RF(3)) is here prefered (to regression models) for two main reasons: RF methods allow notably the number of variables to exceed the number of rows and remain applicable whatever the types of covariates considered.
The function select_pred
integrates in its algorithm the functions cforest
and varimp
of the package party (Hothorn, 2006) and so gives access to their main arguments.
A RF approach generally provides two types of measures for estimating the mean variable importance of each covariate in the prediction of an outcome: the Gini importance and the permutation importance. These measurements must be used with caution, by taking into account the following constraints:
The Gini importance criterion can produce bias in favor of continuous variables and variables with many categories. To avoid this problem, only the permutation criterion is available in the function.
The permutation importance criterion can overestimate the importance of highly correlated predictors.
The function select_pred
proposes three different scenarios according to the types of predictors:
The first one consists in boiling down to a set of categorical variables (ordered or not) by discretizing all the continuous predictors beforehand, using the internal
convert_num
argument or another one, and then works with the conditional importance measures (RF_condi = TRUE
) which give unbiased estimations. In the spirit of a partial correlation, the conditional importance measure related to a variableX
for the prediction of an outcomeY
, only uses the subset of variables the most correlated toX
for its computation. The argumentRF_condi_thr
that corresponds exactly to the argumentthreshold
of the functionvarimp
, fixes a ratio below which a variable Z is considered sufficiently correlated toX
to be used as an adjustment variable in the computation of the importance measure ofX
(In other words, Z is included in the conditioning for the computation, see (4) and (5) for more details). A threshold value of zero will include all variables in the computation of conditional importance measure of each predictorX
, while a threshold< 1
, will only include a subset of variables. Two remarks related to this method: firstly, notice that taking into account only subsets of predictors in the computation of the variable importance measures could lead to a relevant saving of execution time. Secondly, because this approach does not take into account incomplete information, the method will only be applied to complete data (incomplete rows will be temporarily removed for the study).The second possibility, always in presence of mixed types predictors, consists in the execution of two successive RF procedures. The first one will be used to select an unique candidate in each susbset of correlated predictors (detecting in the 1st section), while the second one will extract the permutation measures from the remaining subset of uncorrelated predictors (
RF_condi = FALSE
, by default). This second possibility has the advantage to work in presence of incomplete predictors.The third scenario consists in running a first time the function without RF process (
RF = FALSE
), and according to the presence of highly correlated predictors or not, users can choose to extract redundant predictors manually and re-runs the function with the subset of remaining non-collinear predictors to avoid potential biases introduced by the standard permutations measures.
The three scenarios finally lead to a list of uncorrelated predictors of the outcome sorted in importance order. The argument thresh_Y
corresponds to the minimal percent of importance required (and fixed by user) for a variable to be considered as a reliable predictor of the outcome.
Finally, because all random forest results are subjects to random variation, users can check whether the same importance ranking is achieved by varying the random seed parameter (RF_SEED
) or by increasing the number of trees (RF_ntree
).
Value
A list of 14 (if RF = TRUE
) or 11 objects (Only the first ten objects if RF = FALSE
) is returned:
seed |
the random number generator related to the study |
outc |
the identifier of the outcome to predict |
thresh |
a summarize of the different thresholds fixed for the study |
convert_num |
the labels of the continuous predictors transformed in categorical form |
DB_USED |
the final database used after potential transformations of predictors |
vcrm_OUTC_cat |
a table of pairwise associations between the outcome and the categorical predictors (Cramer's V) |
cor_OUTC_num |
a table of pairwise associations between the outcome and the continuous predictors (Rank correlation) |
vcrm_X_cat |
a table of pairwise associations between the categorical predictors (Cramer's V) |
cor_X_num |
a table of pairwise associations between the continuous predictors (Cramer's V) |
FG_test |
the results of the Farrar and Glauber tests, with and without approximation form |
collinear_PB |
a table of predictors with problem of collinearity according to the fixed thresholds |
drop_var |
the labels of predictors to drop after RF process (optional output: only if |
RF_PRED |
the table of variable importance measurements, conditional or not, according to the argument |
RF_best |
the labels of the best predictors selected (optional output: Only if |
Author(s)
Gregory Guernec
References
Bergsma W. (2013). A bias-correction for Cramer's V and Tschuprow's T. Journal of the Korean Statistical Society, 42, 323–328.
Farrar D, and Glauber R. (1968). Multicolinearity in regression analysis. Review of Economics and Statistics, 49, 92–107.
Breiman L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Hothorn T, Buehlmann P, Dudoit S, Molinaro A, Van Der Laan M (2006). “Survival Ensembles.” Biostatistics, 7(3), 355–373.
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307
See Also
Examples
### Example 1
#-----
# - From two overlayed databases: using the table simu_data
# - Searching for the best predictors of "Yb1"
# - Using the row database
# - The RF approaches are not required
#-----
data(simu_data)
sel_ex1 <- select_pred(simu_data,
Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Y",
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = FALSE
)
### Example 2
#-----
# - With same conditions as example 1
# - Searching for the best predictors of "Yb2"
#-----
sel_ex2 <- select_pred(simu_data,
Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Z",
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = FALSE
)
### Example 3
#-----
# - With same conditions as example 1
# - Using a RF approach to estimate the standard variable importance measures
# and determine the best subset of predictors
# - Here a seed is required
#-----
sel_ex3 <- select_pred(simu_data,
Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Y",
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = TRUE, RF_condi = FALSE, RF_SEED = 3023
)
### Example 4
#-----
# - With same conditions as example 1
# - Using a RF approach to estimate the conditional variable importance measures
# and determine the best subset of predictors
# - This approach requires to convert the numeric variables: Only "Age" here
# discretized in 3 levels
#-----
sel_ex4 <- select_pred(simu_data,
Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Z",
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
convert_num = 8, convert_class = 3,
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = TRUE, RF_condi = TRUE, RF_condi_thr = 0.60, RF_SEED = 3023
)
### Example 5
#-----
# - Starting with a unique database
# - Same conditions as example 1
#-----
simu_A <- simu_data[simu_data$DB == "A", -3] # Base A
sel_ex5 <- select_pred(simu_A,
Y = "Yb1",
quanti = 7, nominal = c(1, 3:4, 6), ordinal = c(2, 5),
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = FALSE
)
### Example 6
#-----
# - Starting with an unique database
# - Using a RF approach to estimate the conditional variable importance measures
# and determine the best subset of predictors
# - This approach requires to convert the numeric variables: Only "Age" here
# discretized in 3 levels
#-----
simu_B <- simu_data[simu_data$DB == "B", -2] # Base B
sel_ex6 <- select_pred(simu_B,
Y = "Yb2",
quanti = 7, nominal = c(1, 3:4, 6), ordinal = c(2, 5),
convert_num = 7, convert_class = 3,
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = TRUE, RF_condi = TRUE, RF_condi_thr = 0.60, RF_SEED = 3023
)