bn.cv {bnlearn} | R Documentation |
Cross-validation for Bayesian networks
Description
Perform a k-fold or hold-out cross-validation for a learning algorithm or a fixed network structure.
Usage
bn.cv(data, bn, loss = NULL, ..., algorithm.args = list(),
loss.args = list(), fit, fit.args = list(), method = "k-fold",
cluster, debug = FALSE)
## S3 method for class 'bn.kcv'
plot(x, ..., main, xlab, ylab, connect = FALSE)
## S3 method for class 'bn.kcv.list'
plot(x, ..., main, xlab, ylab, connect = FALSE)
loss(x)
Arguments
data |
a data frame containing the variables in the model. |
bn |
either a character string (the label of the learning algorithm to
be applied to the training data in each iteration) or an object of class
|
loss |
a character string, the label of a loss function. If none is specified, the default loss function is the Classification Error for Bayesian networks classifiers; otherwise, the Log-Likelihood Loss for both discrete and continuous data sets. See below for additional details. |
algorithm.args |
a list of extra arguments to be passed to the learning algorithm. |
loss.args |
a list of extra arguments to be passed to the loss function
specified by |
fit |
a character string, the label of the method used to fit the
parameters of the network. See |
fit.args |
additional arguments for the parameter estimation procedure,
see again |
method |
a character string, either |
cluster |
an optional cluster object from package parallel. |
debug |
a boolean value. If |
x |
an object of class |
... |
additional objects of class |
main , xlab , ylab |
the title of the plot, an array of labels for the boxplot, the label for the y axis. |
connect |
a logical value. If |
Value
bn.cv()
returns an object of class bn.kcv.list
if runs
is at least 2, an object of class bn.kcv
if runs
is equal to 1.
loss()
returns a numeric vector with a length equal to runs
.
Cross-Validation Strategies
The following cross-validation methods are implemented:
-
k-fold: the
data
are split ink
subsets of equal size. For each subset in turn,bn
is fitted (and possibly learned as well) on the otherk - 1
subsets and the loss function is then computed using that subset. Loss estimates for each of thek
subsets are then combined to give an overall loss fordata
. -
custom-folds: the data are manually partitioned by the user into subsets, which are then used as in k-fold cross-validation. Subsets are not constrained to have the same size, and every observation must be assigned to one subset.
-
hold-out:
k
subsamples of sizem
are sampled independently without replacement from thedata
. For each subsample,bn
is fitted (and possibly learned) on the remainingm - nrow(data)
samples and the loss function is computed on them
observations in the subsample. The overall loss estimate is the average of thek
loss estimates from the subsamples.
If cross-validation is used with multiple runs
, the overall loss is the
averge of the loss estimates from the different runs.
To clarify, cross-validation methods accept the following optional arguments:
-
k
: a positive integer number, the number of groups into which the data will be split (in k-fold cross-validation) or the number of times the data will be split in training and test samples (in hold-out cross-validation). -
m
: a positive integer number, the size of the test set in hold-out cross-validation. -
runs
: a positive integer number, the number of times k-fold or hold-out cross-validation will be run. -
folds
: a list in which element corresponds to one fold and contains the indices for the observations that are included to that fold; or a list with an element for each run, in which each element is itself a list of the folds to be used for that run.
Loss Functions
The following loss functions are implemented:
-
Log-Likelihood Loss (
logl
): also known as negative entropy or negentropy, it is the negated expected log-likelihood of the test set for the Bayesian network fitted from the training set. Lower valuer are better. -
Gaussian Log-Likelihood Loss (
logl-g
): the negated expected log-likelihood for Gaussian Bayesian networks. Lower values are better. -
Classification Error (
pred
): the prediction error for a single node in a discrete network. Frequentist predictions are used, so the values of the target node are predicted using only the information present in its local distribution (from its parents). Lower values are better. -
Posterior Classification Error (
pred-lw
andpred-lw-cg
): similar to the above, but predictions are computed from an arbitrary set of nodes using likelihood weighting to obtain Bayesian posterior estimates.pred-lw
applies to discrete Bayesian networks,pred-lw-cg
to (discrete nodes in) hybrid networks. Lower values are better. -
Exact Classification Error (
pred-exact
): closed-form exact posterior predictions are available for Bayesian network classifiers. Lower values are better. -
Predictive Correlation (
cor
): the correlation between the observed and the predicted values for a single node in a Gaussian Bayesian network. Higher values are better. -
Posterior Predictive Correlation (
cor-lw
andcor-lw-cg
): similar to the above, but predictions are computed from an arbitrary set of nodes using likelihood weighting to obtain Bayesian posterior estimates.cor-lw
applies to Gaussian networks andcor-lw-cg
to (continuous nodes in) hybrid networks. Higher values are better. -
Mean Squared Error (
mse
): the mean squared error between the observed and the predicted values for a single node in a Gaussian Bayesian network. Lower values are better. -
Posterior Mean Squared Error (
mse-lw
andmse-lw-cg
): similar to the above, but predictions are computed from an arbitrary set of nodes using likelihood weighting to obtain Bayesian posterior estimates.mse-lw
applies to Gaussian networks andmse-lw-cg
to (continuous nodes in) hybrid networks. Lower values are better.
Optional arguments that can be specified in loss.args
are:
-
predict
: a character string, the label of the method used to predict the observations in the test set. The default is"parents"
. Other possible values are the same as inpredict()
. -
predict.args
: a list containing the optional arguments for the prediction method. See the documentation forpredict()
for more details. -
target
: a character string, the label of target node for prediction in all loss functions butlogl
,logl-g
andlogl-cg
. -
from
: a vector of character strings, the labels of the nodes used to predict thetarget
node inpred-lw
,pred-lw-cg
,cor-lw
,cor-lw-cg
,mse-lw
andmse-lw-cg
. The default is to use all the other nodes in the network. Loss functionspred
,cor
andmse
implicitly predict only from the parents of thetarget
node. -
n
: a positive integer, the number of particles used by likelihood weighting forpred-lw
,pred-lw-cg
,cor-lw
,cor-lw-cg
,mse-lw
andmse-lw-cg
. The default value is500
.
Note that if bn
is a Bayesian network classifier, pred
and
pred-lw
both give exact posterior predictions computed using the
closed-form formulas for naive Bayes and TAN.
Plotting Results from Cross-Validation
Both plot methods accept any combination of objects of class bn.kcv
or
bn.kcv.list
(the first as the x
argument, the remaining as the
...
argument) and plot the respected expected loss values side by side.
For a bn.kcv
object, this mean a single point; for a bn.kcv.list
object this means a boxplot.
Author(s)
Marco Scutari
References
Koller D, Friedman N (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
See Also
Examples
bn.cv(learning.test, 'hc', loss = "pred", loss.args = list(target = "F"))
folds = list(1:2000, 2001:3000, 3001:5000)
bn.cv(learning.test, 'hc', loss = "logl", method = "custom-folds",
folds = folds)
xval = bn.cv(gaussian.test, 'mmhc', method = "hold-out",
k = 5, m = 50, runs = 2)
xval
loss(xval)
## Not run:
# comparing algorithms with multiple runs of cross-validation.
gaussian.subset = gaussian.test[1:50, ]
cv.gs = bn.cv(gaussian.subset, 'gs', runs = 10)
cv.iamb = bn.cv(gaussian.subset, 'iamb', runs = 10)
cv.inter = bn.cv(gaussian.subset, 'inter.iamb', runs = 10)
plot(cv.gs, cv.iamb, cv.inter,
xlab = c("Grow-Shrink", "IAMB", "Inter-IAMB"), connect = TRUE)
# use custom folds.
folds = split(sample(nrow(gaussian.subset)), seq(5))
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)
# multiple runs, with custom folds.
folds = replicate(5, split(sample(nrow(gaussian.subset)), seq(5)),
simplify = FALSE)
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)
## End(Not run)