knnest,meany,vary,loclin,predict.knn,preprocessx,kmin,parvsnonparplot,nonparvsxplot,l1,l2,kNN,bestKperPoint {regtools} | R Documentation |
k-NN Nonparametric Regression and Classification
Description
Full set of tools for k-NN regression and classification, including both for direct usage and as tools for assessing the fit of parametric models.
Usage
kNN(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,expandVars=NULL,expandVals=NULL,
smoothingFtn=mean,allK=FALSE,leave1out=FALSE, classif=FALSE,
startAt1=TRUE,saveNhbrs=FALSE,savedNhbrs=NULL)
knnest(y,xdata,k,nearf=meany)
preprocessx(x,kmax,xval=FALSE)
meany(nearIdxs,x,y,predpt)
mediany(nearIdxs,x,y,predpt)
vary(nearIdxs,x,y,predpt)
loclin(nearIdxs,x,y,predpt)
## S3 method for class 'knn'
predict(object,...)
kmin(y,xdata,lossftn=l2,nk=5,nearf=meany)
parvsnonparplot(lmout,knnout,cex=1.0)
nonparvsxplot(knnout,lmout=NULL)
nonparvarplot(knnout,returnPts=FALSE)
l2(y,muhat)
l1(y,muhat)
MAPE(yhat,y)
bestKperPoint(x,y,maxK,lossFtn="MAPE",classif=FALSE)
kNNallK(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,
expandVars=NULL,expandVals=NULL,smoothingFtn=mean,
allK=FALSE,leave1out=FALSE,classif=FALSE,startAt1=TRUE)
kNNxv(x,y,k,scaleX=TRUE,PCAcomps=0,smoothingFtn=mean,
nSubSam=500)
knnest(y,xdata,k,nearf=meany)
loclogit(nearIdxs,x,y,predpt)
mediany(nearIdxs,x,y,predpt)
exploreExpVars(xtrn, ytrn, xtst, ytst, k, eVar, maxEVal, lossFtn,
eValIncr = 0.05, classif = FALSE, leave1out = FALSE)
plotExpVars(xtrn,ytrn,xtst,ytst,k,eVars,maxEVal,lossFtn,
ylim,eValIncr=0.05,classif=FALSE,leave1out=FALSE)
Arguments
nearf |
Function to be applied to a neighborhood. |
ylim |
Range of Y values for plot. |
lossFtn |
Loss function for plot. |
eVar |
Variable to be expanded. |
eVars |
Variables to be expanded. |
maxEVal |
Maximum expansion value. |
eValIncr |
Increment in range of expansion value. |
xtrn |
Training set for X. |
ytrn |
Training set for Y. |
xtst |
Test set for X. |
ytst |
Test set for Y. |
nearIdxs |
Indices of the neighbors. |
nSubSam |
Number of folds. |
x |
"X" data, predictors, one row per data point, in the training set. |
y |
Response variable data in the training set. Vector or matrix, the latter case for vector-valued response, e.g. multiclass classification. In that case, can be a vector, either (0,1,2,...,) or (1,2,3,...), which automatically is converted into a matrix of dummies. |
newx |
New data points to be predicted. If NULL in |
scaleX |
If TRUE, call |
PCAcomps |
If positive, transform |
expandVars |
Indices of columns in |
expandVals |
The corresponding expansion values. |
smoothingFtn |
Function to apply to the "Y" values in the
set of nearest neighbors. Built-in choices are |
allK |
If TRUE, find regression estimates for all |
leave1out |
If TRUE, omit the 1-nearest neighbor from analysis |
classif |
If TRUE, compute the predicted class labels, not just the regression function values |
startAt1 |
If TRUE, class labels start at 1, else 0. |
k |
Number of nearest neighbors |
saveNhbrs |
If TRUE, place output of |
savedNhbrs |
If non-NULL, this is the |
... |
Needed for consistency with generic. See Details below for 'arguments. |
xdata |
X and associated neighbor indices. Output of
|
object |
Output of |
predpt |
One point on which to predict, as a vector. |
kmax |
Maximal number of nearest neighbors to find. |
maxK |
Maximal number of nearest neighbors to find. |
xval |
Cross-validation flag. If TRUE, then the set of nearest neighbors of a point will not include the point itself. |
lossftn |
Loss function to be used in cross-validation
determination of "best" |
nk |
Number of values of |
lmout |
Output of |
knnout |
Output of |
cex |
R parameter to control dot size in plot. |
muhat |
Vector of estimated regression function values. |
yhat |
Vector of estimated regression function values. |
returnPts |
If TRUE, return matrix of plotted points. |
Details
The kNN
function is the main tool here; knnest
is being
deprecated. (Note too qeKNN
, a wrapper for kNN
; more
on this below.) Here are the capabilities:
In its most basic form, the function will input training data and
output predictions for new cases newx
. By default this is
done for a single value of the number k
of nearest neighbors,
but by setting allK
to TRUE, the user can request that it be
done for all k
through the specified maximum.
In the second form, newx
is set to NULL in the call to
kNN
. No predictions are made; instead, the regression function
is estimated on all data points in x
, which are saved in the return
value. Future new cases can then be predicted from this saved object,
via predict.kNN
(called via the generic predict
).
The call form is predict(knnout,newx,newxK
, with a
default value of 1 for newxK
.
In this second form, the closest k
points to the newx
in
x
are determined as usual, but instead of averaging their Y
values, the average is taken over the fitted regression estimates at
those points. In this manner, there is almost no computational cost
in the prediction stage.
The second form is intended more for production use, so that neighbor distances need not be repeatedly recomputed.
Nearest-neighbor computation can be time-consuming. If more than one
value of k
is anticipated, for the same x
, y
and
newx
, first run with the largest anticipated value of
k
, with saveNhbrs
set to TRUE. Then for other values
of k
, set savedNhbrs
to the nhbrs
component in
the return value of the first call.
In addition, a novel feature allows the user to weight some
predictors more than others. This is done by scaling the given
predictor up or down, according to a specified value. Normally, this
should be done with scaleX = TRUE
, which applies
scale()
to the data. In other words, first we create a "level
playing field" in which all predictors have standard deviation 1.0,
then scale some of them up or down.
Alternatives are provided to calculating the mean Y in the given neighborhood, such as the median and the variance, the latter of possible use in dealing with heterogeneity in linear models.
Another choice of note is to allow local-linear smoothing, by
setting smoothingFtn
to loclin
. Here the value of the
regression function at a point is predicted from a linear fit to the
point's neighbors. This may be especially helpful to counteract bias
near the edges of the data. As in any regression fit, the number of
predictors should be considerably less than the number of neighbors.
Custom functions for smoothing can easily be written, say following
the pattern of loclin
.
The main alternative to kNN
is qeKNN
in the qe* ("quick
and easy") series. It is more convenient, e.g. allowing factor
inputs, but less flexible.
The functions ovaknntrn
and ovaknnpred
are multiclass
wrappers for knnest
and knnpred
, thus also deprecated.
Here y
is coded 0,1,...,m
-1 for the m
classes.
The tools here can be useful for fit assessment of parametric models.
The parvsnonparplot
function plots fitted values of
parameteric model vs. kNN fitted, nonparvsxplot
k-NN fitted
values against each predictor, one by one.
The functions l2
and l1
are used to define L2 and L1
loss.
Author(s)
Norm Matloff
Examples
x <- rbind(c(1,0),c(2,5),c(0,5),c(3,3),c(6,3))
y <- c(8,3,10,11,4)
newx <- c(0,0)
kNN(x,y,newx,2,scaleX=FALSE)
# $whichClosest
# [,1] [,2]
# [1,] 1 4
# $regests
# [1] 9.5
kNN(x,y,newx,3,scaleX=FALSE,smoothingFtn=loclin)$regests
# 7.307692
knnout <- kNN(x,y,newx,2,scaleX=FALSE)
knnout
# $whichClosest
# [,1] [,2]
# [1,] 1 4
# ...
## Not run:
data(mlb)
mlb <- mlb[,c(4,6,5)] # height, age, weight
# fit, then predict 75", age 21, and 72", age 32
knnout <- kNN(mlb[,1:2],mlb[,3],rbind(c(75,21),c(72,32)),25)
knnout$regests
# [1] 202.72 195.72
# fit now, predict later
knnout <- kNN(mlb[,1:2],mlb[,3],NULL,25)
predict(knnout,c(70,28))
# [1] 186.48
data(peDumms)
names(peDumms)
ped <- peDumms[,c(1,20,22:27,29,31,32)]
names(ped)
# fit, and predict income of a 35-year-old man, MS degree, occupation 101,
# worked 50 weeks, using 25 nearest neighbors
kNN(ped[,-10],ped[,10],c(35,1,0,0,1,0,0,0,1,50),25) $regests
# [1] 67540
# fit, and predict occupation 101 for a 35-year-old man, MS degree,
# wage $55K, worked 50 weeks, using 25 nearest neighbors
z <- kNN(ped[,-c(4:8)],ped[,4],c(35,1,0,1,55,50),25,classif=TRUE)
z$regests
# [1] 0.16 16
z$ypreds
# [1] 0 class 0, i.e. not occupation 101; round(0.24) = 0,
# computed by user request, classif = TRUE
# the y argument must be either a vector (2-class setting) or a matrix
# (multiclass setting)
occs <- as.matrix(ped[, 4:8])
z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE)
z$ypreds
# [1] 3 occupation 3, i.e. 102, is predicted
# predict occupation in general; let's bring occ.141 back in (was
# excluded as a predictor due to redundancy)
names(peDumms)
# [1] "age" "cit.1" "cit.2" "cit.3" "cit.4" "cit.5" "educ.1"
# [8] "educ.2" "educ.3" "educ.4" "educ.5" "educ.6" "educ.7" "educ.8"
# [15] "educ.9" "educ.10" "educ.11" "educ.12" "educ.13" "educ.14" "educ.15"
# [22] "educ.16" "occ.100" "occ.101" "occ.102" "occ.106" "occ.140" "occ.141"
# [29] "sex.1" "sex.2" "wageinc" "wkswrkd" "yrentry"
occs <- as.matrix(peDumms[,23:28])
z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE)
z$ypreds
# [1] 3 prediction is occ.102
# try weight age 0.5, wkswrked 1.5; use leave1out to avoid overfit
knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25,leave1out=TRUE)
mean(abs(knnout$regests - ped[,10]))
# [1] 25341.6
# use of the weighted distance feature; deweight age by a factor of 0.5,
# put increased weight on weeks worked, factor of 1.5
knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25,
expandVars=c(1,10),expandVals=c(0.5,1.5),leave1out=TRUE)
mean(abs(knnout$regests - ped[,10]))
# [1] 25196.61
## End(Not run)