yai {yaImpute} | R Documentation |
Find K nearest neighbors
Description
Given a set of observations, yai
separates the observations into reference and target observations,
applies the specified method to project X-variables into a Euclidean space (not always, see argument
method
), andfinds the k-nearest neighbors within the referenece observations and between the reference and target observations.
An alternative method using randomForest
classification and regression trees is provided for steps 2 and 3.
Target observations are those with values for X-variables and
not for Y-variables, while reference observations are those
with no missing values for X-and Y-variables (see Details for the
exception).
Usage
yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE,
nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500,
rfMode="buildClasses",bootstrap=FALSE,ppControl=NULL,sampleVars=NULL,
rfXsubsets=NULL)
Arguments
x |
1) a matrix or data frame containing the X-variables for all
observations with row names are the identification for the observations, or 2) a
one-sided formula defining the X-variables as a linear formula. If
a formula is coded for |
y |
1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula. |
data |
when |
k |
the number of nearest neighbors; default is 1. |
noTrgs |
when TRUE, skip finding neighbors for target observations. |
noRefs |
when TRUE, skip finding neighbors for reference observations. |
nVec |
number of canonical vectors to use (methods |
pVal |
significant level for canonical vectors, used when |
method |
is the strategy used for computing distance and therefore for finding neighbors; the options are quoted key words (see details):
|
ann |
TRUE if |
mtry |
the number of X-variables picked at random when method is |
ntree |
the number of classification and regression trees when method is |
rfMode |
when |
bootstrap |
if |
ppControl |
used to control how canoncial correlation analysis via projection pursuit is done, see Details. |
sampleVars |
the X- and/or Y-variables will be sampled (without replacement) if this is not NULL and greater than zero. If specified as a single unnamed value, that value is used to control the sample size of both X and Y variables. If two unnamed values, then the first is taken for X-variables and the second for Y-variables. If zero, no sampling is done. Otherwise, values are less than 1.0 they are taken as the proportion of the number of variables. Values greater or equal to 1 are number of variables to be included in the sample. Specification of a large number will cause the sequence of variables to be randomized. |
rfXsubsets |
a named list of character vectors where there is one vector for each
Y-variable, see details, only applies when |
Details
See the paper at doi:10.18637/jss.v023.i10 (it includes examples).
The following information is in addition to the content in the papers.
You need not have any Y-variables to run yai for the following methods:
euclidean
, raw
, mahalanobis
, ica
, random
, and
randomForest
(in which case unsupervised classification is
performed). However, normally yai
classifies reference
observations as those with no missing values for X- and Y- variables and
target observations are those with values for X- variables and
missing data for Y-variables. When Y is NULL (there are no Y-variables),
all the observations are considered references. See
newtargets
for an example of how to use yai in this
situation.
When bootstrap=TRUE
the reference observations are sampled with replacement. The
sample size is set to the number of reference observations. Normally, about a third of
the reference observations are left out of the sample; they are often called out-of-bag
samples. The out-of-bag observations are then treated as targets.
When method="msnPP"
projection pursuit from ccaPP is used. The method is
further controlled using argument ppControl
to specify a character vector that has
has two named components.
method - One of the following
"spearman", "kendall", "quadrant", "M", "pearson"
, default is "spearman"searc - If
"data"
or"proj"
, thenccaProj
is used, otherwise the defaultccaGrid
is used.
Here are some details on argument rfXsubsets
. When method="randomForest"
one call to randomForest
is generated for for each Y-variable. When
argument rfXsubsets
is left NULL
, all the X-variables are used for each of
the Y-variables. However, sometimes better results can be achieved by using specific subsets
of X-variables for each Y-variable. This is done by setting rfXsubsets
equal
to a named list of character vectors. The names correspond to the Y-variable names and the
character vectors hold the list of X-variables for the corresponding Y-variable.
Value
An object of class yai
, which is a list with
the following tags:
call |
the call. |
yRefs , xRefs |
matrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes. |
obsDropped |
a list of the row names for observations dropped for various reasons (missing data). |
trgRows |
a list of the row names for target observations as a subset of all observations. |
xall |
the X-variables for all observations. |
cancor |
returned from cancor function when method |
ccaVegan |
an object of class cca (from package vegan) when method gnn is used. |
ftest |
a list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise). |
yScale , xScale |
scale data used on yRefs and xRefs as needed. |
k |
the value of k. |
pVal |
as input; only used when method |
projector |
NULL when not used. For methods |
nVec |
number of canonical vectors used (methods |
method |
as input, the method used. |
ranForest |
a list of the forests if method |
ICA |
a list of information from |
ann |
the value of ann, TRUE when |
xlevels |
NULL if no factors are used as predictors; otherwise a list
of predictors that have factors and their levels (see |
neiDstTrgs |
a matrix of distances between a target (identified by its row name) and the k references. There are k columns. |
neiIdsTrgs |
a matrix of reference identifications that correspond to neiDstTrgs. |
neiDstRefs , neiIdsRefs |
counterparts for references. |
bootstrap |
a vector of reference rownames that constitute the bootstrap sample;
or the value |
Author(s)
Nicholas L. Crookston ncrookston.fs@gmail.com
John Coulston jcoulston@fs.usda.gov
Andrew O. Finley finleya@msu.edu
See Also
Examples
require (yaImpute)
data(iris)
# set the random number seed so that example results are consistent
# normally, leave out this command
set.seed(12345)
# form some test data, y's are defined only for reference
# observations.
refs=sample(rownames(iris),50)
x <- iris[,1:2] # Sepal.Length Sepal.Width
y <- iris[refs,3:4] # Petal.Length Petal.Width
# build yai objects using 2 methods
msn <- yai(x=x,y=y)
mal <- yai(x=x,y=y,method="mahalanobis")
# compare these results using the generalized mean distances. mal wins!
grmsd(mal,msn)
# use projection pursuit and specify ppControl (loads package ccaPP)
if (require(ccaPP))
{
msnPP <- yai(x=x,y=y,method="msnPP",ppControl=c(method="kendall",search="proj"))
grmsd(mal,msnPP,msn)
}
#############
data(MoscowMtStJoe)
# convert polar slope and aspect measurements to cartesian
# (which is the same as Stage's (1976) transformation).
polar <- MoscowMtStJoe[,40:41]
polar[,1] <- polar[,1]*.01 # slope proportion
polar[,2] <- polar[,2]*(pi/180) # aspect radians
cartesian <- t(apply(polar,1,function (x)
{return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) }))
colnames(cartesian) <- c("xSlAsp","ySlAsp")
x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64])
y <- MoscowMtStJoe[,1:35]
msn <- yai(x=x, y=y, method="msn", k=1)
mal <- yai(x=x, y=y, method="mahalanobis", k=1)
# the results can be plotted.
plot(mal,vars=yvars(mal)[1:16])
# compare these results using the generalized mean distances..
grmsd(mal,msn)
# try method="gower"
if (require(gower))
{
gow <- yai(x=x, y=y, method="gower", k=1)
# compare these results using the generalized mean distances..
grmsd(mal,msn,gow)
}
# try method="randomForest"
if (require(randomForest))
{
# reduce the plant community data for randomForest.
yba <- MoscowMtStJoe[,1:17]
ybaB <- whatsMax(yba,nbig=7) # see help on whatsMax
rf <- yai(x=x, y=ybaB, method="randomForest", k=1)
# build the imputations for the original y's
rforig <- impute(rf,ancillaryData=y)
# compare the results using individual rmsd's
compare.yai(mal,msn,rforig)
plot(compare.yai(mal,msn,rforig))
# build another randomForest case forcing regression
# to be used for continuous variables. The answers differ
# but one is not clearly better than the other.
rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression")
rforig2 <- impute(rf2,ancillaryData=y)
compare.yai(rforig2,rforig)
}