R: Fast Naive Bayes Classifier for different Distributions

fastNaiveBayes {fastNaiveBayes}

R Documentation

Fast Naive Bayes Classifier for different Distributions

Description

Extremely fast implementation of a Naive Bayes Classifier.

A Naive Bayes classifier that assumes independence between the feature variables. Currently, either a Bernoulli, multinomial, or Gaussian distribution can be used. The bernoulli distribution should be used when the features are 0 or 1 to indicate the presence or absence of the feature in each document. The multinomial distribution should be used when the features are the frequency that the feature occurs in each document. Finally, the Gaussian distribution should be used with numerical variables. The distribution parameter is used to mix different distributions for different columns in the input matrix

Use fastNaiveBayes(...) or fnb.train(...) for a mixed event distribution model. fnb.bernoulli, fnb.multinomial, fnb.gaussian and for the specific distributions

Usage

fastNaiveBayes(
  x,
  y,
  priors = NULL,
  laplace = 0,
  sparse = FALSE,
  check = TRUE,
  distribution = fnb.detect_distribution(x)
)

## Default S3 method:
fastNaiveBayes(
  x,
  y,
  priors = NULL,
  laplace = 0,
  sparse = FALSE,
  check = TRUE,
  distribution = fnb.detect_distribution(x)
)

fnb.bernoulli(x, y, priors = NULL, laplace = 0, sparse = FALSE, check = TRUE)

## Default S3 method:
fnb.bernoulli(x, y, priors = NULL, laplace = 0, sparse = FALSE, check = TRUE)

fnb.gaussian(x, y, priors = NULL, sparse = FALSE, check = TRUE)

## Default S3 method:
fnb.gaussian(x, y, priors = NULL, sparse = FALSE, check = TRUE)

fnb.multinomial(x, y, priors = NULL, laplace = 0, sparse = FALSE, check = TRUE)

## Default S3 method:
fnb.multinomial(x, y, priors = NULL, laplace = 0, sparse = FALSE, check = TRUE)

fnb.poisson(x, y, priors = NULL, sparse = FALSE, check = TRUE)

## Default S3 method:
fnb.poisson(x, y, priors = NULL, sparse = FALSE, check = TRUE)

fnb.train(
  x,
  y,
  priors = NULL,
  laplace = 0,
  sparse = FALSE,
  check = TRUE,
  distribution = fnb.detect_distribution(x)
)

## Default S3 method:
fnb.train(
  x,
  y,
  priors = NULL,
  laplace = 0,
  sparse = FALSE,
  check = TRUE,
  distribution = fnb.detect_distribution(x)
)

Arguments

`x`	a numeric matrix, or a dgcMatrix. For bernoulli should only contain 0's and 1's. For multinomial should only contain integers.
`y`	a factor of classes to classify
`priors`	a numeric vector with the priors. If left empty the priors will be determined by the relative frequency of the classes in the data
`laplace`	A number used for Laplace smoothing. Default is 0
`sparse`	Use a sparse matrix? If true a sparse matrix will be constructed from x. It's possible to directly feed a sparse dgcMatrix as x, which will set this parameter to TRUE
`check`	Whether to enable formal checks on input. Recommended to set to TRUE. Set to FALSE will make it faster, but at your own risk.
`distribution`	A list with distribution names and column names for which to use the distribution, see examples.

Details

fastNaiveBayes(...) will convert non numeric columns to one hot encoded features to use with the Bernoulli event model. NA's in x will be set to 0 by default and observations with NA in y will be removed.

The distribution that is used for each feature is determined by a set of rules: - if the column only contains 0's and 1's a Bernoulli event model will be used - if the column only contains whole numbers a Multinomial event model will be used - if none of the above a Gaussian event model will be used.

By setting sparse = TRUE the numeric matrix x will be converted to a sparse dgcMatrix. This can be considerably faster in case few observations have a value different than 0.

It's also possible to directly supply a sparse dgcMatrix, which can be a lot faster in case a fastNaiveBayes model is trained multiple times on the same matrix or a subset of this. See examples for more details. Bear in mind that converting to a sparse matrix can actually be slower depending on the data.

Value

A fitted object of class "fastNaiveBayes". It has four components:

model: Fitted fastNaiveBayes model
names: Names of features used to train this fastNaiveBayes model
distribution: Distribution used for each column of x
levels: Levels of y

Examples

rm(list = ls())
library(fastNaiveBayes)
cars <- mtcars
y <- as.factor(ifelse(cars$mpg > 25, "High", "Low"))
x <- cars[,2:ncol(cars)]

mod <- fastNaiveBayes(x, y, laplace = 1)

pred <- predict(mod, newdata = x)
mean(y!=pred)

mod <- fnb.train(x, y, laplace = 1)

pred <- predict(mod, newdata = x)
mean(y!=pred)

dist <- fnb.detect_distribution(x)

bern <- fnb.bernoulli(x[,dist$bernoulli], y, laplace = 1)
pred <- predict(bern, x[,dist$bernoulli])
mean(y!=pred)

mult <- fnb.multinomial(x[,dist$multinomial], y, laplace = 1)
pred <- predict(mult, x[,dist$multinomial])
mean(y!=pred)

gauss <- fnb.gaussian(x[,dist$gaussian], y)
pred <- predict(gauss, x[,dist$gaussian])
mean(y!=pred)

[Package fastNaiveBayes version 2.2.1 Index]