pvs.logreg {pvclass} | R Documentation |
P-Values to Classify New Observations (Penalized Multicategory Logistic Regression)
Description
Computes nonparametric p-values for the potential class memberships of new observations. The p-values are based on 'penalized logistic regression'.
Usage
pvs.logreg(NewX, X, Y, tau.o = 10, find.tau=FALSE, delta=2, tau.max=80, tau.min=1,
a0 = NULL, b0 = NULL,
pen.method = c('vectors', 'simple', 'none'),
progress = FALSE)
Arguments
NewX |
data matrix consisting of one or several new observations (row vectors) to be classified. |
X |
matrix containing training observations, where each observation is a row vector. |
Y |
vector indicating the classes which the training observations belong to. |
tau.o |
the penalty parameter (see section 'Details' below). |
find.tau |
logical. If TRUE the program searches for the best |
delta |
factor for the penalty parameter. Should be greater than 1. Only needed if |
tau.max |
maximal penalty parameter considered. Only needed if |
tau.min |
minimal penalty parameter considered. Only needed if |
a0 , b0 |
optional starting values for logistic regression. |
pen.method |
the method of penalization (see section 'Details' below). |
progress |
optional parameter for reporting the status of the computations. |
Details
Computes nonparametric p-values for the potential class memberships of new observations. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i]
equals b
.
This p-value is based on a permutation test applied to an estimated Bayesian likelihood ratio, using 'penalized logistic regression'. This means, the conditional probability of Y = y
, given X = x
, is assumed to be proportional to exp(a_y + b_y^T x)
. The parameters a_y
, b_y
are estimated via penalized maximum log-likelihood. The penalization is either a weighted sum of the euclidean norms of the vectors (b_1[j],b_2[j],\ldots,b_L[j])
(pen.method=='vectors'
) or a weighted sum of all moduli |b_{\theta}[j]|
(pen.method=='simple'
). The weights are given by tau.o
times the sample standard deviation (within groups) of the j
-th components of the feature vectors.
In case of pen.method=='none'
, no penalization is used, but this option may be unstable.
If find.tau == TRUE
, the program searches for the best penalty parameter. To determine the best parameter tau
for the p-value PV[i,b]
, the new observation NewX[i,]
is added to the training data with class label b
and then for all training observations with Y[j] != b
the estimated probability of X[j,]
belonging to class b
is computed. Then the tau
which minimizes the sum of these values is chosen. First, tau.o
is compared with tau.o*delta
. If tau.o*delta
is better, it is compared with tau.o*delta^2
, etc. The maximal parameter considered is tau.max
. If tau.o
is better than tau.o*delta
, it is compared with tau.o*delta^-1
, etc. The minimal parameter considered is tau.min
.
Value
PV
is a matrix containing the p-values. Precisely, for each new observation NewX[i,]
and each class b
the number PV[i,b]
is a p-value for the null hypothesis that Y[i] = b
.
If find.tau == TRUE
, PV
has an attribute "tau.opt"
, which is a matrix and tau.opt[i,b]
is the best tau
for observation NewX[i,]
and class b
(see section 'Details'). tau.opt[i,b]
is used to compute the p-value for observation NewX[i,]
and class b
.
Author(s)
Niki Zumbrunnen niki.zumbrunnen@gmail.com
Lutz Dümbgen lutz.duembgen@stat.unibe.ch
www.imsv.unibe.ch/duembgen/index_ger.html
References
Zumbrunnen N. and Dümbgen L. (2017) pvclass: An R Package for p Values for Classification. Journal of Statistical Software 78(4), 1–19. doi:10.18637/jss.v078.i04
Dümbgen L., Igl B.-W. and Munk A. (2008) P-Values for Classification. Electronic Journal of Statistics 2, 468–493, available at http://dx.doi.org/10.1214/08-EJS245.
Zumbrunnen N. (2014) P-Values for Classification – Computational Aspects and Asymptotics. Ph.D. thesis, University of Bern, available at http://boris.unibe.ch/id/eprint/53585.
See Also
pvs, pvs.gaussian, pvs.knn, pvs.wnn
Examples
X <- iris[c(1:49, 51:99, 101:149), 1:4]
Y <- iris[c(1:49, 51:99, 101:149), 5]
NewX <- iris[c(50, 100, 150), 1:4]
pvs.logreg(NewX, X, Y, tau.o=1, pen.method="vectors", progress=TRUE)
# A bigger data example: Buerk's hospital data.
## Not run:
data(buerk)
X.raw <- as.matrix(buerk[,1:21])
Y.raw <- buerk[,22]
n0.raw <- sum(1 - Y.raw)
n1 <- sum(Y.raw)
n0 <- 3*n1
X0 <- X.raw[Y.raw==0,]
X1 <- X.raw[Y.raw==1,]
tmpi0 <- sample(1:n0.raw,size=3*n1,replace=FALSE)
tmpi1 <- sample(1:n1 ,size= n1,replace=FALSE)
Xtrain <- rbind(X0[tmpi0[1:(n0-100)],],X1[1:(n1-100),])
Ytrain <- c(rep(1,n0-100),rep(2,n1-100))
Xtest <- rbind(X0[tmpi0[(n0-99):n0],],X1[(n1-99):n1,])
Ytest <- c(rep(1,100),rep(2,100))
PV <- pvs.logreg(Xtest,Xtrain,Ytrain,tau.o=2,progress=TRUE)
analyze.pvs(Y=Ytest,pv=PV,pvplot=FALSE)
## End(Not run)