R: Construct a Neyman-Pearson Classifier from a sample of class...

npc {nproc}

R Documentation

Construct a Neyman-Pearson Classifier from a sample of class 0 and class 1.

Description

Given a type I error upper bound alpha and a violation upper bound delta, npc calculates the Neyman-Pearson Classifier which controls the type I error under alpha with probability at least 1-delta.

Usage

npc(x = NULL, y, method = c("logistic", "penlog", "svm", "randomforest",
  "lda", "slda", "nb", "nnb", "ada", "tree"), alpha = 0.05, delta = 0.05,
  split = 1, split.ratio = 0.5, n.cores = 1, band = FALSE,
  nfolds = 10, randSeed = 0, warning = TRUE, ...)

Arguments

`x`	n * p observation matrix. n observations, p covariates.
`y`	n 0/1 observatons.
`method`	base classification method. logistic: Logistic regression. glm function with family = 'binomial' penlog: Penalized logistic regression with LASSO penalty. `glmnet` in `glmnet` package svm: Support Vector Machines. `svm` in `e1071` package randomforest: Random Forest. `randomForest` in `randomForest` package lda: Linear Discriminant Analysis. `lda` in `MASS` package slda: Sparse Linear Discriminant Analysis with LASSO penalty. nb: Naive Bayes. `naiveBayes` in `e1071` package nnb: Nonparametric Naive Bayes. `naive_bayes` in `naivebayes` package ada: Ada-Boost. `ada` in `ada` package
`alpha`	the desirable upper bound on type I error. Default = 0.05.
`delta`	the violation rate of the type I error. Default = 0.05.
`split`	the number of splits for the class 0 sample. Default = 1. For ensemble version, choose split > 1.
`split.ratio`	the ratio of splits used for the class 0 sample to train the base classifier. The rest are used to estimate the threshold. Can also be set to be "adaptive", which will be determined using a data-driven method implemented in `find.optim.split`. Default = 0.5.
`n.cores`	number of cores used for parallel computing. Default = 1. WARNING: windows machine is not supported.
`band`	whether to generate both lower and upper bounds of type II error. Default = FALSE.
`nfolds`	number of folds for performing adaptive split ratio selection. Default = 10.
`randSeed`	the random seed used in the algorithm.
`warning`	whether to show various warnings in the program. Default = TRUE.
`...`	additional arguments.

Value

An object with S3 class npc.

`fits`	a list of length max(1,split), represents the fit during each split.
`method`	the base classification method.
`split`	the number of splits used.

References

Xin Tong, Yang Feng, and Jingyi Jessica Li (2018), Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC), Science Advances, 4, 2, eaao1659.

Examples

set.seed(1)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
xtest = matrix(rnorm(n*2),n,2)
ctest = 1+3*xtest[,1]
ytest = rbinom(n,1,1/(1+exp(-ctest)))

##Use lda classifier and the default type I error control with alpha=0.05, delta=0.05
fit = npc(x, y, method = 'lda')
pred = predict(fit,xtest)
fit.score = predict(fit,x)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

## Not run: 
##Ensembled lda classifier with split = 11,  alpha=0.05, delta=0.05
fit = npc(x, y, method = 'lda', split = 11)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, change the method to logistic regression and change alpha to 0.1
fit = npc(x, y, method = 'logistic', alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, change the method to adaboost
fit = npc(x, y, method = 'ada', alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, try the adaptive splitting ratio
fit = npc(x, y, method = 'ada', alpha = 0.1, split.ratio = 'adaptive')
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
cat('Splitting ratio:', fit$split.ratio)

## End(Not run)

[Package nproc version 2.1.5 Index]