R: k-NN Nonparametric Regression and Classification

knnest,meany,vary,loclin,predict.knn,preprocessx,kmin,parvsnonparplot,nonparvsxplot,l1,l2,kNN,bestKperPoint {regtools}

R Documentation

k-NN Nonparametric Regression and Classification

Description

Full set of tools for k-NN regression and classification, including both for direct usage and as tools for assessing the fit of parametric models.

Usage

kNN(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,expandVars=NULL,expandVals=NULL,
   smoothingFtn=mean,allK=FALSE,leave1out=FALSE, classif=FALSE,
   startAt1=TRUE,saveNhbrs=FALSE,savedNhbrs=NULL)
knnest(y,xdata,k,nearf=meany)
preprocessx(x,kmax,xval=FALSE)
meany(nearIdxs,x,y,predpt) 
mediany(nearIdxs,x,y,predpt) 
vary(nearIdxs,x,y,predpt) 
loclin(nearIdxs,x,y,predpt) 
## S3 method for class 'knn'
predict(object,...)
kmin(y,xdata,lossftn=l2,nk=5,nearf=meany) 
parvsnonparplot(lmout,knnout,cex=1.0) 
nonparvsxplot(knnout,lmout=NULL) 
nonparvarplot(knnout,returnPts=FALSE)
l2(y,muhat)
l1(y,muhat)
MAPE(yhat,y)
bestKperPoint(x,y,maxK,lossFtn="MAPE",classif=FALSE)
kNNallK(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,
   expandVars=NULL,expandVals=NULL,smoothingFtn=mean,
   allK=FALSE,leave1out=FALSE,classif=FALSE,startAt1=TRUE)
kNNxv(x,y,k,scaleX=TRUE,PCAcomps=0,smoothingFtn=mean,
   nSubSam=500)
knnest(y,xdata,k,nearf=meany)
loclogit(nearIdxs,x,y,predpt)
mediany(nearIdxs,x,y,predpt) 
exploreExpVars(xtrn, ytrn, xtst, ytst, k, eVar, maxEVal, lossFtn, 
    eValIncr = 0.05, classif = FALSE, leave1out = FALSE) 
plotExpVars(xtrn,ytrn,xtst,ytst,k,eVars,maxEVal,lossFtn,
   ylim,eValIncr=0.05,classif=FALSE,leave1out=FALSE)

Arguments

`nearf`	Function to be applied to a neighborhood.
`ylim`	Range of Y values for plot.
`lossFtn`	Loss function for plot.
`eVar`	Variable to be expanded.
`eVars`	Variables to be expanded.
`maxEVal`	Maximum expansion value.
`eValIncr`	Increment in range of expansion value.
`xtrn`	Training set for X.
`ytrn`	Training set for Y.
`xtst`	Test set for X.
`ytst`	Test set for Y.
`nearIdxs`	Indices of the neighbors.
`nSubSam`	Number of folds.
`x`	"X" data, predictors, one row per data point, in the training set.
`y`	Response variable data in the training set. Vector or matrix, the latter case for vector-valued response, e.g. multiclass classification. In that case, can be a vector, either (0,1,2,...,) or (1,2,3,...), which automatically is converted into a matrix of dummies.
`newx`	New data points to be predicted. If NULL in `kNN`, compute regression functions estimates on `x` and save for future prediction with `predict.kNN`
`scaleX`	If TRUE, call `scale` on `x` and `newx`
`PCAcomps`	If positive, transform `x` and `newx` by PCA, using the top `PCAcomps` principal components. Disabled.
`expandVars`	Indices of columns in `x` to expand.
`expandVals`	The corresponding expansion values.
`smoothingFtn`	Function to apply to the "Y" values in the set of nearest neighbors. Built-in choices are `meany`, `mediany`, `vary` and `loclin`.
`allK`	If TRUE, find regression estimates for all `k` through `kmax`. Currently disabled.
`leave1out`	If TRUE, omit the 1-nearest neighbor from analysis
`classif`	If TRUE, compute the predicted class labels, not just the regression function values
`startAt1`	If TRUE, class labels start at 1, else 0.
`k`	Number of nearest neighbors
`saveNhbrs`	If TRUE, place output of `FNN::get.knnx` into `nhbrs` of component in return value
`savedNhbrs`	If non-NULL, this is the `nhbrs` component in the return value of a previous call; `newx` must be the same in both calls
`...`	Needed for consistency with generic. See Details below for 'arguments.
`xdata`	X and associated neighbor indices. Output of `preprocessx`.
`object`	Output of `knnest`.
`predpt`	One point on which to predict, as a vector.
`kmax`	Maximal number of nearest neighbors to find.
`maxK`	Maximal number of nearest neighbors to find.
`xval`	Cross-validation flag. If TRUE, then the set of nearest neighbors of a point will not include the point itself.
`lossftn`	Loss function to be used in cross-validation determination of "best" `k`.
`nk`	Number of values of `k` to try in cross-validation.
`lmout`	Output of `lm`.
`knnout`	Output of `knnest`.
`cex`	R parameter to control dot size in plot.
`muhat`	Vector of estimated regression function values.
`yhat`	Vector of estimated regression function values.
`returnPts`	If TRUE, return matrix of plotted points.

Details

The kNN function is the main tool here; knnest is being deprecated. (Note too qeKNN, a wrapper for kNN; more on this below.) Here are the capabilities:

In its most basic form, the function will input training data and output predictions for new cases newx. By default this is done for a single value of the number k of nearest neighbors, but by setting allK to TRUE, the user can request that it be done for all k through the specified maximum.

In the second form, newx is set to NULL in the call to kNN. No predictions are made; instead, the regression function is estimated on all data points in x, which are saved in the return value. Future new cases can then be predicted from this saved object, via predict.kNN (called via the generic predict). The call form is predict(knnout,newx,newxK, with a default value of 1 for newxK.

In this second form, the closest k points to the newx in x are determined as usual, but instead of averaging their Y values, the average is taken over the fitted regression estimates at those points. In this manner, there is almost no computational cost in the prediction stage.

The second form is intended more for production use, so that neighbor distances need not be repeatedly recomputed.

Nearest-neighbor computation can be time-consuming. If more than one value of k is anticipated, for the same x, y and newx, first run with the largest anticipated value of k, with saveNhbrs set to TRUE. Then for other values of k, set savedNhbrs to the nhbrs component in the return value of the first call.

In addition, a novel feature allows the user to weight some predictors more than others. This is done by scaling the given predictor up or down, according to a specified value. Normally, this should be done with scaleX = TRUE, which applies scale() to the data. In other words, first we create a "level playing field" in which all predictors have standard deviation 1.0, then scale some of them up or down.

Alternatives are provided to calculating the mean Y in the given neighborhood, such as the median and the variance, the latter of possible use in dealing with heterogeneity in linear models.

Another choice of note is to allow local-linear smoothing, by setting smoothingFtn to loclin. Here the value of the regression function at a point is predicted from a linear fit to the point's neighbors. This may be especially helpful to counteract bias near the edges of the data. As in any regression fit, the number of predictors should be considerably less than the number of neighbors.

Custom functions for smoothing can easily be written, say following the pattern of loclin.

The main alternative to kNN is qeKNN in the qe* ("quick and easy") series. It is more convenient, e.g. allowing factor inputs, but less flexible.

The functions ovaknntrn and ovaknnpred are multiclass wrappers for knnest and knnpred, thus also deprecated. Here y is coded 0,1,...,m-1 for the m classes.

The tools here can be useful for fit assessment of parametric models. The parvsnonparplot function plots fitted values of parameteric model vs. kNN fitted, nonparvsxplot k-NN fitted values against each predictor, one by one.

The functions l2 and l1 are used to define L2 and L1 loss.

Author(s)

Norm Matloff

Examples


x <- rbind(c(1,0),c(2,5),c(0,5),c(3,3),c(6,3))
y <- c(8,3,10,11,4)
newx <- c(0,0)

kNN(x,y,newx,2,scaleX=FALSE)
# $whichClosest
#      [,1] [,2]
# [1,]    1    4
# $regests
# [1] 9.5

kNN(x,y,newx,3,scaleX=FALSE,smoothingFtn=loclin)$regests
# 7.307692

knnout <- kNN(x,y,newx,2,scaleX=FALSE)
knnout
# $whichClosest
#      [,1] [,2]
# [1,]    1    4
# ...

## Not run: 
data(mlb) 
mlb <- mlb[,c(4,6,5)]  # height, age, weight
# fit, then predict 75", age 21, and 72", age 32
knnout <- kNN(mlb[,1:2],mlb[,3],rbind(c(75,21),c(72,32)),25) 
knnout$regests
# [1] 202.72 195.72

# fit now, predict later
knnout <- kNN(mlb[,1:2],mlb[,3],NULL,25) 
predict(knnout,c(70,28)) 
# [1] 186.48

data(peDumms) 
names(peDumms) 
ped <- peDumms[,c(1,20,22:27,29,31,32)] 
names(ped) 

# fit, and predict income of a 35-year-old man, MS degree, occupation 101,
# worked 50 weeks, using 25 nearest neighbors
kNN(ped[,-10],ped[,10],c(35,1,0,0,1,0,0,0,1,50),25) $regests
# [1] 67540

# fit, and predict occupation 101 for a 35-year-old man, MS degree, 
# wage $55K, worked 50 weeks, using 25 nearest neighbors
z <- kNN(ped[,-c(4:8)],ped[,4],c(35,1,0,1,55,50),25,classif=TRUE)
z$regests
# [1] 0.16  16
z$ypreds
# [1] 0  class 0, i.e. not occupation 101; round(0.24) = 0, 
# computed by user request, classif = TRUE

# the y argument must be either a vector (2-class setting) or a matrix
# (multiclass setting)
occs <- as.matrix(ped[, 4:8])
z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE)
z$ypreds
# [1] 3   occupation 3, i.e. 102, is predicted

# predict occupation in general; let's bring occ.141 back in (was
# excluded as a predictor due to redundancy)
names(peDumms)
#  [1] "age"     "cit.1"   "cit.2"   "cit.3"   "cit.4"   "cit.5"   "educ.1" 
#  [8] "educ.2"  "educ.3"  "educ.4"  "educ.5"  "educ.6"  "educ.7"  "educ.8" 
# [15] "educ.9"  "educ.10" "educ.11" "educ.12" "educ.13" "educ.14" "educ.15"
# [22] "educ.16" "occ.100" "occ.101" "occ.102" "occ.106" "occ.140" "occ.141"
# [29] "sex.1"   "sex.2"   "wageinc" "wkswrkd" "yrentry"
occs <- as.matrix(peDumms[,23:28])  
z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE)
z$ypreds
# [1] 3   prediction is occ.102

# try weight age 0.5, wkswrked 1.5; use leave1out to avoid overfit
knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25,leave1out=TRUE)
mean(abs(knnout$regests - ped[,10]))
# [1] 25341.6

# use of the weighted distance feature; deweight age by a factor of 0.5,
# put increased weight on weeks worked, factor of 1.5
knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25,
   expandVars=c(1,10),expandVals=c(0.5,1.5),leave1out=TRUE)
mean(abs(knnout$regests - ped[,10]))
# [1] 25196.61




## End(Not run)

[Package regtools version 1.7.0 Index]