linComb {dtComb} | R Documentation |
Combine two diagnostic tests with several linear combination methods.
Description
The linComb
function calculates the combination
scores of two diagnostic tests selected among several linear combination
methods and standardization options.
Usage
linComb(
markers = NULL,
status = NULL,
event = NULL,
method = c("scoring", "SL", "logistic", "minmax", "PT", "PCL", "minimax", "TS"),
resample = c("none", "cv", "repeatedcv", "boot"),
nfolds = 5,
nrepeats = 3,
niters = 10,
standardize = c("none", "range", "zScore", "tScore", "mean", "deviance"),
ndigits = 0,
show.plot = TRUE,
direction = c("auto", "<", ">"),
conf.level = 0.95,
cutoff.method = c("CB", "MCT", "MinValueSp", "MinValueSe", "ValueSp", "ValueSe",
"MinValueSpSe", "MaxSp", "MaxSe", "MaxSpSe", "MaxProdSpSe", "ROC01", "SpEqualSe",
"Youden", "MaxEfficiency", "Minimax", "MaxDOR", "MaxKappa", "MinValueNPV",
"MinValuePPV", "ValueNPV", "ValuePPV", "MinValueNPVPPV", "PROC01", "NPVEqualPPV",
"MaxNPVPPV", "MaxSumNPVPPV", "MaxProdNPVPPV", "ValueDLR.Negative",
"ValueDLR.Positive", "MinPvalue", "ObservedPrev", "MeanPrev", "PrevalenceMatching"),
...
)
Arguments
markers |
a numeric a numeric data frame that includes two diagnostic tests
results
|
status |
a factor vector that includes the actual disease
status of the patients
|
event |
a character string that indicates the event in the status
to be considered as positive event
|
method |
a character string specifying the method used for
combining the markers.
Notations:
Before getting into these methods,
let us first introduce some notations that will be used
throughout this vignette. Let
D_i, i = 1, 2, \ldots, n_1
be the marker values of i\text{th} individual in diseased group, where
D_i = (D_{i1}, D_{i2}) and
H_j, j=1,2, \ldots, n_2
be the marker values of j\text{th} individual in healthy group, where
H_j = H_{j1}, H_{j2} .
Let
x_i1 = c(D_{i1}, H_{j1}) be the values of the first marker, and
x_i2 = c(D_{i2}, H_{j2}) be values of the second marker for the i\text{th}
individual i= 1,2, \ldots, n . Let
D_{i,min} = min(D_{i1}, D_{i2}), D_{i,max} = max(D_{i1}, D_{i2}) ,
H_{j,min} = min(H_{j1}, H_{j2}), H_{j,max} = max(H_{j1}, H_{j2}) and
c_i be be the resulting combination score for the i\text{th} individual.
The available methods are:
-
Logistic Regression (logistic) : Combination score obtained
by fitting a logistic regression modelis as follows:
c_i = \left(\frac{e^ {\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}}}\right)
A combination score obtained by fitting a logistic regression model typically refers
to the predicted probability or score assigned to each observation
in a dataset based on the logistic regression model’s
fitted values
-
Scoring based on Logistic Regression (scoring) : Combination score is obtained using the
slope values of the relevant logistic regression model, slope values are rounded to the number of
digits taken from the user.
c_i = \beta_1 x_{i1} + \beta_2 x_{i2}
-
Pepe & Thompson’s method (PT) : The Pepe and Thompson combination score,
developed using their optimal linear combination technique, aims to maximize
the Mann-Whitney statistic in the same way that the Min-max method does. Unlike
the Min-max method, the Pepe and Thomson method takes into account all marker
values instead of just the lowest and maximum values.
maximize\; U(\alpha) = \left(\frac{1}{n_1,n_2}\right) {\sum_{i=1}^{n_1} {\sum_{j=1}^{n_2}}I(D_{i1} + \alpha D_{i2} >= H_{j1} + \alpha H_{j2})}
c_i = x_{i1} + \alpha x_{i2}
-
Pepe, Cai & Langton’s method (PCL) : Pepe, Cai and Langton combination score
obtained by using AUC as the parameter of a logistic regression model.
maximize\; U(\alpha) = \left(\frac{1}{n_1,n_2}\right) {\sum_{i=1}^{n_1} {\sum_{j=1}^{n_2}}I(D_{i1} + \alpha D_{i2} >}
H_{j1} + \alpha H_{j2}) + \left(\frac{1}{2} \right) I(D_{i1} + \alpha D_{i2} = H_{j1} + \alpha H_{j2})
-
Min-Max method (minmax) : This method linearly combines the minimum
and maximum values of the markers by finding a parameter,\alpha , that
maximizes the Mann-Whitney statistic, an empirical estimate of the ROC area.
maximize\;U( \alpha ) = \left(\frac{1}{n_1,n_2}\right) {\sum_{i=1}^{n_1} {\sum_{j=1}^{n_2}}I(D_{i,max} + \alpha D_{i,min} > H_{j,max} + \alpha H_{j,min})}
c_i = x_{i,max} + \alpha x_{i,min}
where x_{i,max} = max(x_{i1},x_{i2}) and x_{i,min} = min(x_{i1}, x_{i2})
-
Su & Liu’s method (SL) : The Su and Liu combination score is computed through
Fisher’s discriminant coefficients, which assumes that the underlying
data follow a multivariate normal distribution, and the covariance matrices across
different classes are assumed to be proportional.Assuming that
D\sim N(\mu_D,\textstyle \sum_D)
and
H\sim N(\mu_H,\textstyle \sum_H) represent
the multivariate normal distributions for the diseased and non-diseased groups,
respectively. The Fisher’s coefficients are as follows:
(\alpha , \beta) = (\textstyle \sum_{D}+\sum_{H})^{\;-1}\mu
\text{where} \mu_=\mu_D - \mu_H. \text{The combination score in this case is:}
c_i = \alpha x_{i1} + \beta x_{i2}
-
Minimax approach (minimax) : Combination score obtained with the Minimax procedure;
t parameter is chosen as the value that gives the maximum AUC from the
combination score. Suppose that D follows a multivariate normal distribution
D\sim N(\mu_D,\textstyle \sum_D) , representing diseased group and H follows
a multivariate normal distribution H\sim N(\mu_H,\textstyle \sum_H) , representing the non-diseased group.
Then Fisher’s coefficients are as follows:
(\alpha , \beta) = {[t { \textstyle \sum_{D}} + (1 - t) \textstyle \sum_{H}] ^ {-1}}{(\mu_D - \mu_H)}
c_i = b_1 x_1 + b_2 x_2
-
Todor & Saplacan’s method (TS) :Combination score obtained by using
the trigonometric functions of the \Theta value that optimizes the corresponding AUC.
c_i = sin(\theta) x_{i1} + cos(\theta) x_{i2}
|
resample |
a character string indicating the name of the
resampling options. Bootstrapping Cross-validation and repeated cross-validation
are given as the options for resampling, along with the number
of folds and number of repeats.
-
boot : Bootstrapping is performed similarly; the dataset
is divided into folds with replacement and models are trained and tested
in these folds to determine the best parameters for the given method and
dataset.
-
cv : Cross-validation resampling, the dataset is divided into the
number of folds given without replacement; in each iteration, one fold is
selected as the test set, and the model is built using the remaining folds
and tested on the test set. The corresponding AUC values and the parameters
used for the combination are kept in a list. The best-performed model is
selected, and the combination score is returned for the whole dataset.
-
repeatedcv : Repeated cross-validation the process is repeated,
and the best-performed models selected at each step are stored in another
list; the best performed among these models is selected to be applied to
the entire dataset.
|
nfolds |
a numeric value that indicates the number of folds for
cross validation based resampling methods (5, default)
|
nrepeats |
a numeric value that indicates the number of repeats
for "repeatedcv" option of resampling methods (3, default)
|
niters |
a numeric value that indicates the number of
bootstrapped resampling iterations (10, default)
|
standardize |
a character string indicating the name of the
standardization method. The default option is no standardization applied.
Available options are:
-
Z-score (zScore) : This method scales the data to have a mean
of 0 and a standard deviation of 1. It subtracts the mean and divides by the standard
deviation for each feature. Mathematically,
Z-score = \frac{x - (\overline x)}{sd(x)}
where x is the value of a marker, \overline{x} is the mean of the marker and sd(x) is the standard deviation of the marker.
-
T-score (tScore) : T-score is commonly used
in data analysis to transform raw scores into a standardized form.
The standard formula for converting a raw score x into a T-score is:
T-score = \Biggl(\frac{x - (\overline x)}{sd(x)}\times 10 \Biggl) +50
where x is the value of a marker, \overline{x} is the mean of the marker
and sd(x) is the standard deviation of the marker.
-
Range (a.k.a. min-max scaling) (range) : This method transforms data to
a specific range, between 0 and 1. The formula for this method is:
Range = \frac{x - min(x)}{max(x) - min(x)}
-
Mean (mean) : This method, which helps
to understand the relative size of a single observation concerning
the mean of dataset, calculates the ratio of each data point to the mean value
of the dataset.
Mean = \frac{x}{\overline{x}}
where x is the value of a marker and \overline{x} is the mean of the marker.
-
Deviance (deviance) : This method, which allows for
comparison of individual data points in relation to the overall spread of
the data, calculates the ratio of each data point to the standard deviation
of the dataset.
Deviance = \frac{x}{sd(x)}
where x is the value of a marker and sd(x) is the standard deviation of the marker.
|
ndigits |
a integer value to indicate the number of decimal
places to be used for rounding in Scoring method (0, default)
|
show.plot |
a logical a logical . If TRUE, a ROC curve is
plotted. Default is TRUE
|
direction |
a character string determines in which direction the
comparison will be made. ">": if the predictor values for the control group
are higher than the values of the case group (controls > cases).
"<": if the predictor values for the control group are lower or equal than
the values of the case group (controls < cases).
|
conf.level |
a numeric values determines the confidence interval
for the roc curve(0.95, default).
|
cutoff.method |
a character string determines the cutoff method
for the roc curve.
|
... |
further arguments. Currently has no effect on the results.
|
Value
A list of numeric
linear combination scores calculated
according to the given method and standardization option.
Author(s)
Serra Ilayda Yerlitas, Serra Bersan Gengec, Necla Kochan,
Gozde Erturk Zararsiz, Selcuk Korkmaz, Gokmen Zararsiz
Examples
# call data
data(exampleData1)
# define the function parameters
markers <- exampleData1[, -1]
status <- factor(exampleData1$group, levels = c("not_needed", "needed"))
event <- "needed"
score1 <- linComb(
markers = markers, status = status, event = event,
method = "logistic", resample = "none", show.plot = TRUE,
standardize = "none", direction = "<", cutoff.method = "Youden"
)
# call data
data(exampleData2)
# define the function parameters
markers <- exampleData2[, -c(1:3, 6:7)]
status <- factor(exampleData2$Group, levels = c("normals", "carriers"))
event <- "carriers"
score2 <- linComb(
markers = markers, status = status, event = event,
method = "PT", resample = "none", standardize = "none", direction = "<",
cutoff.method = "Youden"
)
score3 <- linComb(
markers = markers, status = status, event = event,
method = "minmax", resample = "none", direction = "<",
cutoff.method = "Youden"
)
[Package
dtComb version 1.0.2
Index]