pvs {klaR} | R Documentation |
Pairwise variable selection for classification
Description
Pairwise variable selection for numerical data, allowing the use of different classifiers and different variable selection methods.
Usage
pvs(x, ...)
## Default S3 method:
pvs(x, grouping, prior=NULL, method="lda",
vs.method=c("ks.test","stepclass","greedy.wilks"), niveau=0.05,
fold=10, impr=0.1, direct="backward", out=FALSE, ...)
## S3 method for class 'formula'
pvs(formula, data = NULL, ...)
Arguments
x |
matrix or data frame containing the explanatory variables
(required, if |
formula |
A formula of the form |
data |
data matrix (rows=cases, columns=variables) |
grouping |
class indicator vector (a factor) |
prior |
prior probabilites for the classes. If not specified the prior probabilities will be set according to proportion in “grouping”. If specified the order of prior probabilities must be the same as in “grouping”. |
method |
character, name of classification function (e.g. “ |
vs.method |
character, name of variable selection method. Must be one of “ |
niveau |
used niveau for “ |
fold |
parameter for cross-validation, if “ |
impr |
least improvement of performance measure desired to include or exclude any variable (<=1), if “ |
direct |
direction of variable selection, if “ |
out |
indicator (logical) for textoutput during computation (slows down computation!), if “ |
... |
further parameters passed to classification function (‘ |
Details
The classification “method” (e.g. ‘lda
’) must have its own
‘predict
’ method (like ‘predict.lda
’ for ‘lda
’)
returns a list with an element ‘posterior
’ containing the posterior probabilties. It must be able to deal with matrices as in method(x, grouping, ...)
.
Examples of such classification methods are ‘lda
’, ‘qda
’, ‘rda
’,
‘NaiveBayes
’ or ‘sknn
’.\
For the classification methods “svm
” and “randomForest
” there are special routines implemented, to make them work with ‘pvs
’ method even though their ‘predict
’ methods don't provide the demanded posteriors. However those two classfiers can not be used together with variable selection method “stepclass
”.
‘pvs
’ performs a variable selection using the selection method chosen in ‘vs.method
’ for each pair of classes in ‘x
’.
Then for each pair of classes a submodel using ‘method
’ is trained (using only the earlier selected variables for this class-pair).
If ‘method
’ is “ks.test
”, then for each variable the empirical distribution functions of the cases of both classes are compared via “ks.test
”. Only variables with a p-values below ‘niveau
’ are used for training the submodel for this pair of classes.
If ‘method
’ is “stepclass
” the variable selection is performed using the “stepclass
” method.
If ‘method
’ is “greedy.wilks
” the variable selection is performed using Wilk's lambda criterion.
Value
An object of class ‘pvs
’ containing the following components:
classes |
the classes in grouping |
prior |
used prior probabilities |
method |
name of used classification function |
vs.method |
name of used function for variable selection |
submodels |
containing a list of submodels. For each pair of classes there is a list element being another list of 3 containing the class-pair of this submodel, the selected variables for the subspace of classes and the result of the trained classification function. |
call |
the (matched) function call |
Author(s)
Gero Szepannek, szepannek@statistik.tu-dortmund.de, Christian Neumann
References
Szepannek, G. and Weihs, C. (2006) Variable Selection for Classification of More than Two Classes Where the Data are Sparse. In From Data and Information Analysis to Kwnowledge Engineering., eds Spiliopolou, M., Kruse, R., Borgelt, C., Nuernberger, A. and Gaul, W. pp. 700-708. Springer, Heidelberg.
Szepannek, G. (2008): Different Subspace Classification - Datenanalyse, -interpretation, -visualisierung und Vorhersage in hochdimensionalen Raeumen, ISBN 978-3-8364-6302-7, vdm, Saarbruecken.
See Also
predict.pvs
for predicting ‘pvs
’ models and locpvs
for pairwisevariable selection in local models of several subclasses
Examples
## Example 1: learn an "lda" model on the waveform data using pairwise variable
## selection (pvs) using "ks.test" and compare it to using lda without pvs
library("mlbench")
trainset <- mlbench.waveform(300)
pvsmodel <- pvs(trainset$x, trainset$classes, niveau=0.05) # default: using method="lda"
## short summary, showing the class-pairs of the submodels and the selected variables
pvsmodel
testset <- mlbench.waveform(500)
## prediction of the test data set:
prediction <- predict(pvsmodel, testset$x)
## calculating the test error rate
1-sum(testset$classes==prediction$class)/length(testset$classes)
## Bayes error is 0.149
## comparison to performance of simple lda
ldamodel <- lda(trainset$x, trainset$classes)
LDAprediction <- predict(ldamodel, testset$x)
## test error rate
1-sum(testset$classes==LDAprediction$class)/length(testset$classes)
## Example 2: learn a "qda" model with pvs on half of the Satellite dataset,
## using "ks.test"
library("mlbench")
data("Satellite")
## takes few seconds as exact KS tests are calculated here:
model <- pvs(classes ~ ., Satellite[1:3218,], method="qda", vs.method="ks.test")
## short summary, showing the class-pairs of the submodels and the selected variables
model
## now predict on the rest of the data set:
## pred <- predict(model,Satellite[3219:6435,]) # takes some time
pred <- predict(model,Satellite[3219:6435,], quick=TRUE) # that's much quicker
## now you can look at the predicted classes:
pred$class
## or the posterior probabilities:
pred$posterior