grmsd {yaImpute} | R Documentation |
Generalized Root Mean Square Distance Between Observed and Imputed Values
Description
Computes the root mean square distance between predicted and corresponding
observed values in an orthogonal multivariate space. This value is the mean
Mahalanobis distance between observed and imputed values in a space defined by
observations and variables were observed and predicted values are defined.
The statistic provides a way to compare imputation (or prediction) results.
While it is designed to work with imputation, the function can be used with objects
that inherit from lm
or with matrices and data frames that
follow the column naming convention described in the details.
Usage
grmsd(...,ancillaryData=NULL,vars=NULL,wts=NULL,rtnVectors=FALSE,imputeMethod="closest")
Arguments
... |
objects created by any combination of
|
ancillaryData |
a data frame that defines variables, passed to
|
vars |
a list of variable names you want to include; if NULL all available
variables are included (note that if impute.yai the
Y-variables are returned when |
wts |
A vector of weights used to compute the mean distances, see details below. |
rtnVectors |
The vectors of individual distances are returned (see Value) rather than the mean Mahalanobis distance. |
imputeMethod |
passed as |
Details
This function is designed to compute the root mean square distance between observed
and predicted observations over several variables at once. It is the Mahalanobis
distance between observed and predicted but the name emphasizes the similarities
to root mean square difference (or error, see rmsd
).
Here are some notable characteristics.
In the univariate case this function returns the same value as
rmsd
withscale=TRUE
. In that case the root mean square difference is computed afterscale
has been called on the variable.Like
rmsd
,grmsd
is zero if the imputed values are exactly the same as the observed values over all variables.Like
rmsd
,grmsd
is ~1.0 when the mean of each variable is imputed in place of a near neighbor (it would be exactly 1.0 if the maximum likelihood estimate of the covariance were used rather than the unbiased estimate – it approaches 1 as n gets large.) This situation corresponds to regression where the slope is zero. It indicates that the imputation error is, over all, the same as it would be if the means of the variables were imputed rather than near neighbors (it does not signal that the means were imputed).Like
rmsd
, values of grmsd > 1.0 indicate that, on average, the errors in the imputation are greater than they would be if the mean of the corresponding variables were imputed for each observation.Note that individual
rmsd
values can be computed even when the variance of the variable is zero. In contrast,grmsd
can only be computed in the situation where the observed data matrix is full rank. Rank is determined usingqr
and columns are removed from the analysis to create this condition if necessary (with a warning).
Observed and predicted are matched using the column names. Column names
that have ".o
" are matched to those that do not. Columns that do not
have matching observed and imputed (predicted) values are ignored.
Several objects may be passed as "...". Function impute.yai
is
called for any objects that were created by yai
;
ancillaryData
and vars
are passed to impute.yai
when it is used.
When objects inherit from lm
, a suitable matrix is formed using
by calling the predict
and resid
functions.
Factors, if found, are removed (with a warning).
When argument wts
is defined there must be one value for each pair of
observed and predicted variables. If the values are named (preferred), then
the names are matched to the names of predicted variables (no .o
suffix).
The matched values effectively scale the axes in which distances are computed.
When this is done, the resulting distances are not Mahalanobis distances.
Value
When rtnVectors=FALSE
, a sorted named vector of mean distances
is returned; the names are taken from the arguments.
When rtnVectors=TRUE
the function returns vectors of distances, sorted and
named as done wnen this argument is FALSE.
Author(s)
Nicholas L. Crookston ncrookston.fs@gmail.com
See Also
yai
, impute.yai
, rmsd.yai
,
notablyDifferent
Examples
require(yaImpute)
data(iris)
set.seed(12345)
# form some test data
refs=sample(rownames(iris),50)
x <- iris[,1:2] # Sepal.Length Sepal.Width
y <- iris[refs,3:4] # Petal.Length Petal.Width
# build yai objects using 2 methods
msn <- yai(x=x,y=y)
mal <- yai(x=x,y=y,method="mahalanobis")
# compute the average distances between observed and imputed (predicted)
grmsd(msn,mal,lmFit=lm(as.matrix(y) ~ ., data=x[refs,]))
# use the all variables and observations in iris
# Species is a factor and is automatically deleted with a warning
grmsd(msn,mal,ancillaryData=iris)
# here is an example using lm, and another using column
# means as predictions.
impMean <- y
colnames(impMean) <- paste0(colnames(impMean),".o")
impMean <- cbind(impMean,y)
# set the predictions to the mean's of the variables
impMean[,"Petal.Length"] <- mean(impMean[,"Petal.Length"])
impMean[,"Petal.Width"] <- mean(impMean[,"Petal.Width"])
grmsd(msn, mal, lmFit=lm(as.matrix(y) ~ ., data=x[refs,]), impMean )
# compare to using function rmsd (values match):
msnimp <- na.omit(impute(msn))
grmsd(msnimp[,c("Petal.Length","Petal.Length.o")])
rmsd(msnimp[,c("Petal.Length","Petal.Length.o")],scale=TRUE)
# these are multivariate cases and they don't match
# because the covariance of the two variables is > 0.
grmsd(msnimp)
colSums(rmsd(msnimp,scale=TRUE))/2
# get the vectors and make a boxplot, identify outliers
stats <- boxplot(grmsd(msn,mal,ancillaryData=iris[,-5],rtnVectors=TRUE),
ylab="Mahalanobis distance")
stats$out
# 118 132
#2.231373 1.990961