BAGofT {BAGofT}R Documentation

A Binary Regression Adaptive Goodness-of-fit Test (BAGofT)

Description

BAGofT is used to test the goodness-of-fit of binary classifiers. The test statistic is constructed based on the results from multiple splittings. In each split, the test first splits the data into a training set and a validation set. Then, it adaptively obtains a partition based on the training set and performs a goodness-of-fit test on the validation set. Details can be found in Zhang, Ding and Yang (2021).

Usage

BAGofT(testModel, parFun = parRF(), data, nsplits = 100,
ne = floor(5*nrow(data)^(1/2)), nsim = 100)

Arguments

testModel

a function that generates predicted results from the classifier to test. Details can be found in "testGlmBi" for binomial regression, "testGlmnet" for penalized logistic regression, "testRF" for random forest, and "testXGboost" for XGboost.

parFun

a function that generates the adaptive partition. The default is ‘parRF()’ that generates a partition by random forest. More information can be found in "parRF".

data

a data frame containing the response and covariates used in the model together with the other covariates not in the model but considered used to generate the partition.

nsplits

number of splits. The default is 100.

ne

the size of the validation set. The default is floor(5*nrow(data)^(1/2)).

nsim

the number of simulated datasets to calculate the bootstrap p-value.

Value

p.value

the bootstrap p-value of the BAGofT test statistic (which combines the results from multiple splitting by taking the average).

p.value2

the bootstrap p-value from an alternative version of the BAGofT test statistic (which combines the results from multiple splitting by taking the sample median).

p.value3

the bootstrap p-value from an alternative version of the BAGofT test statistic (which combines the results from multiple splitting by taking the minimum).

pmean

the BAGofT test statistic (which combines the results from multiple splitting by taking the average).

pmedian

an alternative BAGofT test statistic (which combines the results from multiple splitting by taking the sample median).

pmin

an alternative BAGofT test statistic (which combines the results from multiple splitting by taking the minimum).

simRes

a list that contains the simulated test statitics used to generate the bootstrap p-values. ‘simRes$pmeanSim’, ‘simRes$pmediansim’, ‘simRes$pmeanSim’ corresepond to the three kinds of BAGofT statistics, respectively.

singleSplit.results

a list that contains the results from each splitting. Its elements are as follows.

‘singleSplit.results[[k]]$chisq’: The chi-squared statistic of the BAGofT test from the kth splitting.

‘singleSplit.results[[k]]$p.value’: The p-value calculated from the chi-squared statistic.

‘singleSplit.results[[k]]$ngp’: The number of groups chosen by the adaptive partition.

‘singleSplit.results[[k]]$contri’: The weighted sum of squares from each group.

‘singleSplit.results[[k]]$parRes’: Variable importance (or other results from custom partition functions) from the adaptive partition.

References

Zhang, Ding and Yang (2021) "Is a Classification Procedure Good Enough?-A Goodness-of-Fit Assessment Tool for Classification Learning" arXiv preprint arXiv:1911.03063v2 (2021).

Examples

## Not run: 
###################################################
# Generate a sample dataset.
###################################################
# set the random seed
set.seed(20)
# set the number of observations
n <- 200

# generate covariates data
x1dat <- runif(n, -3, 3)
x2dat <- rnorm(n, 0, 1)
x3dat <- rchisq(n, 4)

# set coefficients
beta1 <- 1
beta2 <- 1
beta3 <- 1

# calculate the linear predictor data
lindat <- x1dat * beta1 + x2dat * beta2 + x3dat * beta3
# calculate the probabilities by inverse logit link
pdat <- 1/(1 + exp(-lindat))

# generate the response data
ydat <- sapply(pdat, function(x) stats :: rbinom(1, 1, x))

# generate the dataset
dat <- data.frame(y = ydat, x1 = x1dat, x2 = x2dat,
                    x3 = x3dat)

###################################################
# Obtain the testing result
###################################################
# Test a logistic regression that misses 'x3'. The partition
# variables are 'x1', 'x2', and 'x3'.
testRes <- BAGofT(testModel =testGlmBi(formula = y ~ x1 + x2 , link = "logit"),
       parFun = parRF(parVar = c("x1", "x2", "x3")),
       data = dat)

# the bootstrap p-value is 0. Therefore, the test is rejected
print(testRes$p.value)

# the variable importance from the adaptive partition shows that x3 is likely
# to be the reason for the overfitting (,which is correct since the formula
# fm misses the x3).
print(VarImp(testRes))

## End(Not run)

[Package BAGofT version 1.0.0 Index]