R: Find K nearest neighbors

yai {yaImpute}

R Documentation

Find K nearest neighbors

Description

Given a set of observations, yai

separates the observations into reference and target observations,
applies the specified method to project X-variables into a Euclidean space (not always, see argument method), and
finds the k-nearest neighbors within the referenece observations and between the reference and target observations.

An alternative method using randomForest classification and regression trees is provided for steps 2 and 3. Target observations are those with values for X-variables and not for Y-variables, while reference observations are those with no missing values for X-and Y-variables (see Details for the exception).

Usage

yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE,
    nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500,
    rfMode="buildClasses",bootstrap=FALSE,ppControl=NULL,sampleVars=NULL,
    rfXsubsets=NULL)

Arguments

`x`	1) a matrix or data frame containing the X-variables for all observations with row names are the identification for the observations, or 2) a one-sided formula defining the X-variables as a linear formula. If a formula is coded for `x`, one must be used for `y` as well, if needed.
`y`	1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula.
`data`	when `x` and `y` are formulas, then data is a data frame or matrix that contains all the variables with row names are the identification for the observations. The observations are split by `yai` into two sets.
`k`	the number of nearest neighbors; default is 1.
`noTrgs`	when TRUE, skip finding neighbors for target observations.
`noRefs`	when TRUE, skip finding neighbors for reference observations.
`nVec`	number of canonical vectors to use (methods `msn` and `msn2`), or number of independent of X-variables reference data when method `mahalanobis`. When NULL, the number is set by the function.
`pVal`	significant level for canonical vectors, used when `method` is `msn` or `msn2`.
`method`	is the strategy used for computing distance and therefore for finding neighbors; the options are quoted key words (see details): euclidean - distance is computed in a normalized X space. raw - like euclidean, except no normalization is done. mahalanobis - distance is computed in its namesakes space. ica - like mahalanobis, but based on Independent Component Analysis using package `fastICA`. msn - distance is computed in a projected canonical space. msn2 - like msn, but with variance weighting (canonical regression rather than correlation). msnPP - like msn, except that the canonical correlation is computed using projection pursuit from ccaPP (see argument `ppControl`). gnn - distance is computed using a projected ordination of Xs found using canonical correspondence analysis (`cca` from package vegan). If `cca` fails, `rda` is used and a warning is issued. randomForest - distance is one minus the proportion of randomForest trees where a target observation is in the same terminal node as a reference observation (see `randomForest`). random - like raw except that the X space is a single vector of uniform random [0,1] numbers generated using `runif`, results in random assignment of neighbors, and forces `ann` to be FALSE. gower - distance is computed in its namesakes space using function `gower_topn` from package gower; forces `ann` to be FALSE.
`ann`	TRUE if `ann` is used to find neighbors, FALSE if a slow search is used.
`mtry`	the number of X-variables picked at random when method is `randomForest`, see `randomForest`, default is sqrt(number of X-variables).
`ntree`	the number of classification and regression trees when method is `randomForest`. When more than one Y-variable is used, the trees are divided among the variables. Alternatively, ntree can be a vector of values corresponding to each Y-variable.
`rfMode`	when `buildClasses` and method is `randomForest`, continuous variables are internally converted to classes forcing randomForest to build classification trees for the variable. Otherwise, regression trees are built if your version of randomForest is newer than `4.5-18`.
`bootstrap`	if `TRUE`, the reference observations are sampled with replacement.
`ppControl`	used to control how canoncial correlation analysis via projection pursuit is done, see Details.
`sampleVars`	the X- and/or Y-variables will be sampled (without replacement) if this is not NULL and greater than zero. If specified as a single unnamed value, that value is used to control the sample size of both X and Y variables. If two unnamed values, then the first is taken for X-variables and the second for Y-variables. If zero, no sampling is done. Otherwise, values are less than 1.0 they are taken as the proportion of the number of variables. Values greater or equal to 1 are number of variables to be included in the sample. Specification of a large number will cause the sequence of variables to be randomized.
`rfXsubsets`	a named list of character vectors where there is one vector for each Y-variable, see details, only applies when `method="randomForest"`

Details

See the paper at doi:10.18637/jss.v023.i10 (it includes examples).

The following information is in addition to the content in the papers.

You need not have any Y-variables to run yai for the following methods: euclidean, raw, mahalanobis, ica, random, and randomForest (in which case unsupervised classification is performed). However, normally yai classifies reference observations as those with no missing values for X- and Y- variables and target observations are those with values for X- variables and missing data for Y-variables. When Y is NULL (there are no Y-variables), all the observations are considered references. See newtargets for an example of how to use yai in this situation.

When bootstrap=TRUE the reference observations are sampled with replacement. The sample size is set to the number of reference observations. Normally, about a third of the reference observations are left out of the sample; they are often called out-of-bag samples. The out-of-bag observations are then treated as targets.

When method="msnPP" projection pursuit from ccaPP is used. The method is further controlled using argument ppControl to specify a character vector that has has two named components.

method - One of the following "spearman", "kendall", "quadrant", "M", "pearson", default is "spearman"
searc - If "data" or "proj", then ccaProj is used, otherwise the default ccaGrid is used.

Here are some details on argument rfXsubsets. When method="randomForest" one call to randomForest is generated for for each Y-variable. When argument rfXsubsets is left NULL, all the X-variables are used for each of the Y-variables. However, sometimes better results can be achieved by using specific subsets of X-variables for each Y-variable. This is done by setting rfXsubsets equal to a named list of character vectors. The names correspond to the Y-variable names and the character vectors hold the list of X-variables for the corresponding Y-variable.

Value

An object of class yai, which is a list with the following tags:

`call`	the call.
`yRefs`, `xRefs`	matrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes.
`obsDropped`	a list of the row names for observations dropped for various reasons (missing data).
`trgRows`	a list of the row names for target observations as a subset of all observations.
`xall`	the X-variables for all observations.
`cancor`	returned from cancor function when method `msn` or `msn2` is used (NULL otherwise).
`ccaVegan`	an object of class cca (from package vegan) when method gnn is used.
`ftest`	a list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise).
`yScale`, `xScale`	scale data used on yRefs and xRefs as needed.
`k`	the value of k.
`pVal`	as input; only used when method `msn`, `msn2` or `msnPP` is used.
`projector`	NULL when not used. For methods `msn`, `msn2`, `msnPP`, `gnn` and `mahalanobis`, this is a matrix that projects normalized X-variables into a space suitable for doing Eculidian distances.
`nVec`	number of canonical vectors used (methods `msn` and `msn2`), or number of independent X-variables in the reference data when method `mahalanobis` is used.
`method`	as input, the method used.
`ranForest`	a list of the forests if method `randomForest` is used. There is one forest for each Y-variable, or just one forest when there are no Y-variables.
`ICA`	a list of information from `fastICA` when method `ica` is used.
`ann`	the value of ann, TRUE when `ann` is used, FALSE otherwise.
`xlevels`	NULL if no factors are used as predictors; otherwise a list of predictors that have factors and their levels (see `lm`).
`neiDstTrgs`	a matrix of distances between a target (identified by its row name) and the k references. There are k columns.
`neiIdsTrgs`	a matrix of reference identifications that correspond to neiDstTrgs.
`neiDstRefs`, `neiIdsRefs`	counterparts for references.
`bootstrap`	a vector of reference rownames that constitute the bootstrap sample; or the value `FALSE` when bootstrap is not used.

Author(s)

Nicholas L. Crookston ncrookston.fs@gmail.com
John Coulston jcoulston@fs.usda.gov
Andrew O. Finley finleya@msu.edu

Examples


require (yaImpute)

data(iris)

# set the random number seed so that example results are consistent
# normally, leave out this command
set.seed(12345)

# form some test data, y's are defined only for reference
# observations.
refs=sample(rownames(iris),50)
x <- iris[,1:2]      # Sepal.Length Sepal.Width
y <- iris[refs,3:4]  # Petal.Length Petal.Width

# build yai objects using 2 methods
msn <- yai(x=x,y=y)
mal <- yai(x=x,y=y,method="mahalanobis")
# compare these results using the generalized mean distances. mal wins!
grmsd(mal,msn)

# use projection pursuit and specify ppControl (loads package ccaPP)
if (require(ccaPP)) 
{
  msnPP <- yai(x=x,y=y,method="msnPP",ppControl=c(method="kendall",search="proj"))
  grmsd(mal,msnPP,msn)
}

#############

data(MoscowMtStJoe)

# convert polar slope and aspect measurements to cartesian
# (which is the same as Stage's (1976) transformation).

polar <- MoscowMtStJoe[,40:41]
polar[,1] <- polar[,1]*.01      # slope proportion
polar[,2] <- polar[,2]*(pi/180) # aspect radians
cartesian <- t(apply(polar,1,function (x)
               {return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) }))
colnames(cartesian) <- c("xSlAsp","ySlAsp")
x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64])
y <- MoscowMtStJoe[,1:35]

msn <- yai(x=x, y=y, method="msn", k=1)
mal <- yai(x=x, y=y, method="mahalanobis", k=1)
# the results can be plotted.
plot(mal,vars=yvars(mal)[1:16])

# compare these results using the generalized mean distances..
grmsd(mal,msn)

# try method="gower"
if (require(gower))
{
  gow <- yai(x=x, y=y, method="gower", k=1)
  # compare these results using the generalized mean distances..
  grmsd(mal,msn,gow)
}

# try method="randomForest"
if (require(randomForest))
{
  # reduce the plant community data for randomForest.
  yba  <- MoscowMtStJoe[,1:17]
  ybaB <- whatsMax(yba,nbig=7)  # see help on whatsMax
  
  rf <- yai(x=x, y=ybaB, method="randomForest", k=1)
  
  # build the imputations for the original y's
  rforig <- impute(rf,ancillaryData=y)
  
  # compare the results using individual rmsd's
  compare.yai(mal,msn,rforig)
  plot(compare.yai(mal,msn,rforig))
  
  # build another randomForest case forcing regression
  # to be used for continuous variables. The answers differ
  # but one is not clearly better than the other.
  
  rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression")
  rforig2 <- impute(rf2,ancillaryData=y)
  compare.yai(rforig2,rforig)
}

[Package yaImpute version 1.0-34 Index]