naive_bayes {naivebayes} | R Documentation |
Naive Bayes Classifier
Description
naive_bayes
is used to fit Naive Bayes model in which predictors are assumed to be independent within each class label.
Usage
## Default S3 method:
naive_bayes(x, y, prior = NULL, laplace = 0,
usekernel = FALSE, usepoisson = FALSE, ...)
## S3 method for class 'formula'
naive_bayes(formula, data, prior = NULL, laplace = 0,
usekernel = FALSE, usepoisson = FALSE,
subset, na.action = stats::na.pass, ...)
Arguments
x |
matrix or dataframe with categorical (character/factor/logical) or metric (numeric) predictors. |
y |
class vector (character/factor/logical). |
formula |
an object of class |
data |
matrix or dataframe with categorical (character/factor/logical) or metric (numeric) predictors. |
prior |
vector with prior probabilities of the classes. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels. |
laplace |
value used for Laplace smoothing (additive smoothing). Defaults to 0 (no Laplace smoothing). |
usekernel |
logical; if |
usepoisson |
logical; if |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain |
... |
other parameters to |
Details
Numeric (metric) predictors are handled by assuming that they follow Gaussian distribution, given the class label. Alternatively, kernel density estimation can be used (usekernel = TRUE
) to estimate their class-conditional distributions. Also, non-negative integer predictors (variables representing counts) can be modelled with Poisson distribution (usepoisson = TRUE
); for further details please see Note
below. Missing values are not included into constructing tables. Logical variables are treated as categorical (binary) variables.
Value
naive_bayes
returns an object of class "naive_bayes"
which is a list with following components:
data |
list with two components: |
levels |
character vector with values of the class variable. |
laplace |
amount of Laplace smoothing (additive smoothing). |
tables |
list of tables. For each categorical predictor a table with class-conditional probabilities, for each integer predictor a table with Poisson mean (if |
prior |
numeric vector with prior probabilities. |
usekernel |
logical; |
usepoisson |
logical; |
call |
the call that produced this object. |
Note
The class "numeric" contains "double" (double precision floating point numbers) and "integer". Depending on the parameters usekernel
and usepoisson
different class conditional distributions are applied to columns in the dataset with the class "numeric":
If
usekernel=FALSE
andpoisson=FALSE
then Gaussian distribution is applied to each "numeric" variable ("numeric"&"integer" or "numeric"&"double")If
usekernel=TRUE
andpoisson=FALSE
then kernel density estimation (KDE) is applied to each "numeric" variable ("numeric"&"integer" or "numeric"&"double")If
usekernel=FALSE
andpoisson=TRUE
then Gaussian distribution is applied to each "double" vector and Poisson to each "integer" vector. (Gaussian: "numeric" & "double"; Poisson: "numeric" & "integer")If
usekernel=TRUE
andpoisson=TRUE
then kernel density estimation (KDE) is applied to each "double" vector and Poisson to each "integer" vector. (KDE: "numeric" & "double"; Poisson: "numeric" & "integer")
By default usekernel=FALSE
and poisson=FALSE
, thus Gaussian is applied to each numeric variable.
On the other hand, "character", "factor" and "logical" variables are assigned to the Categorical distribution with Bernoulli being its special case.
Prior the model fitting the classes of columns in the data.frame "data" can be easily checked via:
-
sapply(data, class)
-
sapply(data, is.numeric)
-
sapply(data, is.double)
-
sapply(data, is.integer)
Author(s)
Michal Majka, michalmajka@hotmail.com
See Also
predict.naive_bayes
, plot.naive_bayes
, tables
, get_cond_dist
, %class%
Examples
### Simulate example data
n <- 100
set.seed(1)
data <- data.frame(class = sample(c("classA", "classB"), n, TRUE),
bern = sample(LETTERS[1:2], n, TRUE),
cat = sample(letters[1:3], n, TRUE),
logical = sample(c(TRUE,FALSE), n, TRUE),
norm = rnorm(n),
count = rpois(n, lambda = c(5,15)))
train <- data[1:95, ]
test <- data[96:100, -1]
### 1) General usage via formula interface
nb <- naive_bayes(class ~ ., train)
summary(nb)
# Classification
predict(nb, test, type = "class")
nb %class% test
# Posterior probabilities
predict(nb, test, type = "prob")
nb %prob% test
# Helper functions
tables(nb, 1)
get_cond_dist(nb)
# Note: all "numeric" (integer, double) variables are modelled
# with Gaussian distribution by default.
### 2) General usage via matrix/data.frame and class vector
X <- train[-1]
class <- train$class
nb2 <- naive_bayes(x = X, y = class)
nb2 %prob% test
### 3) Model continuous variables non-parametrically
### via kernel density estimation (KDE)
nb_kde <- naive_bayes(class ~ ., train, usekernel = TRUE)
summary(nb_kde)
get_cond_dist(nb_kde)
nb_kde %prob% test
# Visualize class conditional densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")
plot(nb_kde, "count", arg.num = list(legend.cex = 0.9), prob = "conditional")
### ?density and ?bw.nrd for further documentation
# 3.1) Change Gaussian kernel to biweight kernel
nb_kde_biweight <- naive_bayes(class ~ ., train, usekernel = TRUE,
kernel = "biweight")
nb_kde_biweight %prob% test
plot(nb_kde_biweight, c("norm", "count"),
arg.num = list(legend.cex = 0.9), prob = "conditional")
# 3.2) Change "nrd0" (Silverman's rule of thumb) bandwidth selector
nb_kde_SJ <- naive_bayes(class ~ ., train, usekernel = TRUE,
bw = "SJ")
nb_kde_SJ %prob% test
plot(nb_kde_SJ, c("norm", "count"),
arg.num = list(legend.cex = 0.9), prob = "conditional")
# 3.3) Adjust bandwidth
nb_kde_adjust <- naive_bayes(class ~ ., train, usekernel = TRUE,
adjust = 1.5)
nb_kde_adjust %prob% test
plot(nb_kde_adjust, c("norm", "count"),
arg.num = list(legend.cex = 0.9), prob = "conditional")
### 4) Model non-negative integers with Poisson distribution
nb_pois <- naive_bayes(class ~ ., train, usekernel = TRUE, usepoisson = TRUE)
summary(nb_pois)
get_cond_dist(nb_pois)
# Posterior probabilities
nb_pois %prob% test
# Class conditional distributions
plot(nb_pois, "count", prob = "conditional")
# Marginal distributions
plot(nb_pois, "count", prob = "marginal")
## Not run:
vars <- 10
rows <- 1000000
y <- sample(c("a", "b"), rows, TRUE)
# Only categorical variables
X1 <- as.data.frame(matrix(sample(letters[5:9], vars * rows, TRUE),
ncol = vars))
nb_cat <- naive_bayes(x = X1, y = y)
nb_cat
system.time(pred2 <- predict(nb_cat, X1))
## End(Not run)