regdiag {freqparcoord} | R Documentation |
Diagnosing regression model fit using parallel coordinates.
Description
Performs parametric regression model fit diagnostics, based on freqparcoord. One axis is the "divergences," the differences beween the parametric and nonparametric estimates of the population regression function, while the other axes are the predictor variables. Note that the divergences are NOT the parametric model residuals, e.g. differences between fitted model values and response ("Y") values.
The question addressed is, "In what regions of predictor space is the
parametric fit poorer?" To answer that, the divergences are
grouped into upper and lower tails; e.g. if tail
is set to 0.10,
we find the data points that have divergences in the lower and upper
10%, then plot both groups, as well as the middle.
The parallel coordinates plot then can be used to identify regions in which the parametric model tends to either under- or overpredict the response, thus indicating possible addition of interaction or polynomial terms.
Furthermore, in the case for regdiag
in which an lm
object
is input, the adjusted R-squared value for the parametric model and the
R-squared value from the nonparametric fit are computed. If the
nonparametric value is substantially larger than the parametric one,
this is an indication of some deficiency in the parametric model, thus
providing some quantitative information on whether inclusion of
interaction and/or polynomial terms may be useful.
The term regression is used in the sense of condtional mean response given the predictors. Thus parametric classification models such as the logistic may also be used, with the regression function being the condtional probability of response = 1, given the predictors.
Usage
regdiag(regout, tail=0.10, k=NULL, m=5,
checkna = TRUE, cls = NULL, nchunks = length(cls))
regdiagbas(preds, resp, parest, tail=0.10, k=NULL, m=5,
checkna = TRUE, cls = NULL, nchunks = length(cls))
Arguments
regout |
Output of |
preds |
Matrix of predictor values. |
resp |
Vector of response values. |
parest |
Parametric model estimates of the population regression function at the predictor data points. |
tail |
Proportion of most negative and most positive divergences to use in grouping. |
k |
See freqparcoord. |
m |
See freqparcoord. |
checkna |
See freqparcoord. |
cls |
See freqparcoord. |
nchunks |
See freqparcoord. |
Details
The population regression function (including the case of a probability function in a classification problem) is estimated nonparametrically at the observation points, using knnreg.
The nonparametric estimates are subtracted from the parametric ones, yielding the divergences. A frequency-parallel coordinates plot is displayed as described above.
The R-squared values are available in the situation noted earlier. The nonparametric R-squared value is calculated as the squared correlation between estimated regression value and the response value.
It is possible that in one of the tail groups the response value is
constant, in which case an error message appears. If so, try a larger
value of tail
.
Value
An object of type "gg" (a ggplot2 object, displays when printed), with new components added:
The nonparametric regression estimates, in
nonparest
.In the case of a linear model specified via
regout
, the adjusted R-squared value for the parametric model, inparamr2
, andnonparamr2
, the R-squared value from the nonparametric fit. The latter is the squared correlation between predicted and actual response values
Author(s)
Norm Matloff <matloff@cs.ucdavis.edu> and Yingkang Xie <yingkang.xie@gmail.com>
Examples
data(mlb)
lmout <- lm(mlb$Weight ~ mlb$Height + mlb$Age)
p <- regdiag(lmout,0.10,k=50,m=25)
p
# taller, older players are overpredicted, with shorter, younger players
# underpredicted; suggests that adding quadratic terms for Height, Age
# may help in the tails
# let's compare the R-squared values
p$paramr2
p$nonparamr2
# not much difference (param. model a bit better), possibly due to
# small sample size
# doing it "the long way" (showing use without an lm/glm object)
parest <- lmout$fitted.values
regdiagbas(mlb[c("Height","Age")], mlb$Weight,parest,0.10,k=50,m=25)
data(prgeng)
pg <- prgeng
pg1 <- pg[pg$wageinc >= 40000 & pg$wkswrkd >= 48,]
l1 <- lm(wageinc ~ age+educ+sex,data=pg1)
p <- regdiag(l1)
p
p$paramr2
p$nonparamr2
# young men's wages underpredicted, older women overpredicted; both
# R-squared values low, but nonpar is about 27% higher, indicating room
# for improvement; interaction and polynomial terms may help
## Not run:
data(newadult)
g1 <- glm(gt50 ~ edu + age + gender + mar, data=newadult, family=binomial)
regdiag(g1)
# parametric model underpredicts older highly-educated married men,
# and overpredicts young female lesser-educated singles; might try adding
# interaction terms
## End(Not run)