R: Cross-Validated P-Values (k Nearest Neighbors)

cvpvs.knn {pvclass}

R Documentation

Cross-Validated P-Values (k Nearest Neighbors)

Description

Computes cross-validated nonparametric p-values for the potential class memberships of the training data. The p-values are based on 'k nearest neighbors'.

Usage

cvpvs.knn(X, Y, k = NULL, distance = c('euclidean', 'ddeuclidean',
          'mahalanobis'), cova = c('standard', 'M', 'sym'))

Arguments

`X`	matrix containing training observations, where each observation is a row vector.
`Y`	vector indicating the classes which the training observations belong to.
`k`	number of nearest neighbors. If `k` is a vector or `k = NULL`, the program searches for the best `k`. For more information see section 'Details'.
`distance`	the distance measure: "euclidean": fixed Euclidean distance, "ddeuclidean": data driven Euclidean distance (component-wise standardization), "mahalanobis": Mahalanobis distance.
`cova`	estimator for the covariance matrix: 'standard': standard estimator, 'M': M-estimator, 'sym': symmetrized M-estimator.

Details

Computes cross-validated nonparametric p-values for the potential class memberships of the training data. Precisely, for each feature vector X[i,] and each class b the number PV[i,b] is a p-value for the null hypothesis that Y[i] = b.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'k nearest neighbors' with estimated prior probabilities N(b)/n. Here N(b) is the number of observations of class b and n is the total number of observations.
If k is a vector, the program searches for the best k. To determine the best k for the p-value PV[i,b], the class label of the training observation X[i,] is set temporarily to b and then for all training observations with Y[j] != b the proportion of the k nearest neighbors of X[j,] belonging to class b is computed. Then the k which minimizes the sum of these values is chosen.
If k = NULL, it is set to 2:ceiling(length(Y)/2).

Value

PV is a matrix containing the cross-validated p-values. Precisely, for each feature vector X[i,] and each class b the number PV[i,b] is a p-value for the null hypothesis that Y[i] = b.
If k is a vector or NULL, PV has an attribute "opt.k", which is a matrix and opt.k[i,b] is the best k for observation X[i,] and class b (see section 'Details'). opt.k[i,b] is used to compute the p-value for observation X[i,] and class b.

Author(s)

Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
www.imsv.unibe.ch/duembgen/index_ger.html

References

Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04

Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at http://dx.doi.org/10.1214/08-EJS245.

Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.

Examples

X <- iris[, 1:4]
Y <- iris[, 5]

cvpvs.knn(X, Y, k = c(5, 10, 15))

[Package pvclass version 1.4 Index]