R: Cross-Validated P-Values (Penalized Multicategory Logistic...

cvpvs.logreg {pvclass}

R Documentation

Cross-Validated P-Values (Penalized Multicategory Logistic Regression)

Description

Computes cross-validated nonparametric p-values for the potential class memberships of the training data. The p-values are based on 'penalized logistic regression'.

Usage

cvpvs.logreg(X, Y, tau.o=10, find.tau=FALSE, delta=2, tau.max=80, tau.min=1,
             pen.method = c("vectors", "simple", "none"), progress = TRUE)

Arguments

`X`	matrix containing training observations, where each observation is a row vector.
`Y`	vector indicating the classes which the training observations belong to.
`tau.o`	the penalty parameter (see section 'Details' below).
`find.tau`	logical. If TRUE the program searches for the best `tau`. For more information see section 'Details'.
`delta`	factor for the penalty parameter. Should be greater than 1. Only needed if `find.tau == TRUE`.
`tau.max`	maximal penalty parameter considered. Only needed if `find.tau == TRUE`.
`tau.min`	minimal penalty parameter considered. Only needed if `find.tau == TRUE`.
`pen.method`	the method of penalization (see section 'Details' below).
`progress`	optional parameter for reporting the status of the computations.

Details

Computes cross-validated nonparametric p-values for the potential class memberships of the training data. Precisely, for each feature vector X[i,] and each class b the number PV[i,b] is a p-value for the null hypothesis that Y[i] equals b, based on the remaining training observations.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'penalized logistic regression'. This means, the conditional probability of Y = y, given X = x, is assumed to be proportional to exp(a_y + b_y^T x). The parameters a_y, b_y are estimated via penalized maximum log-likelihood. The penalization is either a weighted sum of the euclidean norms of the vectors (b_1[j],b_2[j],\ldots,b_L[j]) (pen.method=='vectors') or a weighted sum of all moduli |b_y[j]| (pen.method=='simple'). The weights are given by tau.o times the sample standard deviation (within groups) of the j-th components of the feature vectors. In case of pen.method=='none', no penalization is used, but this option may be unstable.
If find.tau == TRUE, the program searches for the best penalty parameter. To determine the best parameter tau for the p-value PV[i,b], the class label of the training observation X[i,] is set temporarily to b and then for all training observations with Y[j] != b the estimated probability of X[j,] belonging to class b is computed. Then the tau which minimizes the sum of these values is chosen. First, tau.o is compared with tau.o*delta. If tau.o*delta is better, it is compared with tau.o*delta^2, etc. The maximal parameter considered is tau.max. If tau.o is better than tau.o*delta, it is compared with tau.o*delta^-1, etc. The minimal parameter considered is tau.min.

Value

PV is a matrix containing the cross-validated p-values. Precisely, for each feature vector X[i,] and each class b the number PV[i,b] is a p-value for the null hypothesis that Y[i] = b, based on the remaining training observations.
If find.tau == TRUE, PV has an attribute "tau.opt", which is a matrix and tau.opt[i,b] is the best tau for observation X[i,] and class b (see section 'Details'). tau.opt[i,b] is used to compute the p-value for observation X[i,] and class b.

Author(s)

Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
www.imsv.unibe.ch/duembgen/index_ger.html

References

Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04

Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at http://dx.doi.org/10.1214/08-EJS245.

Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.

Examples

## Not run: 
X <- iris[, 1:4]
Y <- iris[, 5]

cvpvs.logreg(X, Y, tau.o=1, pen.method="vectors",progress=TRUE)

## End(Not run)

# A bigger data example: Buerk's hospital data.
## Not run: 
data(buerk)
X.raw <- as.matrix(buerk[,1:21])
Y.raw <- buerk[,22]
n0.raw <- sum(1 - Y.raw)
n1 <- sum(Y.raw)
n0 <- 3*n1

X0 <- X.raw[Y.raw==0,]
X1 <- X.raw[Y.raw==1,]

tmpi0 <- sample(1:n0.raw,size=n0,replace=FALSE)
tmpi1 <- sample(1:n1    ,size=n1,replace=FALSE)

X <- rbind(X0[tmpi0,],X1)
Y <- c(rep(1,n0),rep(2,n1))

str(X)
str(Y)

PV <- cvpvs.logreg(X,Y,
	tau.o=5,pen.method="v",progress=TRUE)

analyze.pvs(Y=Y,pv=PV,pvplot=FALSE)

## End(Not run)

[Package pvclass version 1.4 Index]