heatmap.fit {heatmapFit} | R Documentation |
Heatmap Fit Statistic for Binary Dependent Variable Models
Description
Generates a fit plot for diagnosing misspecification in models of binary dependent variables, and calculates the related heatmap fit statistic (Esarey and Pierce, 2012).
Usage
heatmap.fit(y, pred, calc.boot = TRUE, reps = 1000, span.l = "aicc",
color = FALSE, compress.obs = TRUE, init.grid = 2000, ret.obs = FALSE,
legend = TRUE)
Arguments
y |
A vector of observations of the dependent variable (in {0,1}). |
pred |
A vector of model-predicted Pr(y = 1) corresponding to each element of |
calc.boot |
Calculate bootstrap-based p-values (default = |
reps |
Number of bootstrap replicates to generate (default = 1000). |
span.l |
Bandwidth for the nonparametric fit between |
color |
Whether the plot should be in color ( |
compress.obs |
Whether large data sets should be compressed by pre-binning to save computing time (default |
init.grid |
If |
ret.obs |
Return the one-tailed bootstrap p-value for each observation in |
legend |
Print the legend on the heat map plot (the default, |
Details
This function plots the degree to which a binary dependent variable (BDV) model generates predicted probabilities that are an accurate match for observed empirical probabilities of the BDV, in-sample or out-of-sample. For example, if a model predicts that Pr(y = 1) = k%, about k% of observations with this predicted probability should have y = 1. Loess smoothing (with an automatically-selected optimum bandwidth) is used to estimate empirical probabilities in the data set and to overcome sparseness of the data. Systematic deviations are distinguished from sampling variation via bootstrapping of the distribution under the null that the model is an accurate predictor, with p-values indicating the one-tailed proportion of bootstrap samples that are less-extreme than the observed deviation. The plot shows model predicted probabilities on the x-axis and smoothed empirical probabilities on the y-axis, with a histogram indicating the location and frequency of observations. The ideal fit is a 45-degree line. The shading of the plotted line indicates the degree to which fit deviations are larger than expected due to sampling variation.
A summary statistic for fit (the "heatmap statistic") is also reported. This statistic is the proportion of the sample in a region with one-tailed p-value less than or equal to 10%. Finding more than 20% of the dataset with this p-value in this region is diagnostic of misspecification in the model.
More details for the technique are given in Esarey and Pierce 2012, "Assessing Fit Quality and Testing for Misspecification in Binary Dependent Variable Models," Political Analysis 20(4): 480-500.
Value
If ret.obs = T
, a list with the element:
heatmap.obs.p |
The one-tailed bootstrap p-value corresponding to each observation in |
Note
Code to calculate AICc and GCV written by Michael Friendly (http://tolstoy.newcastle.edu.au/R/help/05/11/15899.html).
Author(s)
Justin Esarey <justin@justinesarey.com>
Andrew Pierce <awpierc@emory.edu>
Jericho Du <jericho.du@gmail.com>
References
Esarey, Justin and Andrew Pierce (2012). "Assessing Fit Quality and Testing for Misspecification in Binary Dependent Variable Models." Political Analysis 20(4): 480-500. DOI:10.1093/pan/mps026.
Examples
## Not run:
## a correctly specified model
###############################
set.seed(123456)
x <- runif(20000)
y <- as.numeric( runif(20000) < pnorm(2*x - 1) )
mod <- glm( y ~ x, family=binomial(link="probit"))
pred <- predict(mod, type="response")
heatmap.fit(y, pred, reps=1000)
## out-of-sample prediction w/o bootstrap p-values
set.seed(654321)
x <- runif(1000)
y <- as.numeric( runif(1000) < pnorm(2*x - 1) )
pred <- predict(mod, type="response", newdata=data.frame(x))
heatmap.fit(y, pred, calc.boot=FALSE)
## a misspecified model
########################
set.seed(13579)
x <- runif(20000)
y <- as.numeric( runif(20000) < pnorm(sin(10*x)) )
mod <- glm( y ~ x, family=binomial(link="probit"))
pred <- predict(mod, type="response")
heatmap.fit(y, pred, reps=1000)
## Comparison with and without data compression
system.time(heatmap.fit(y, pred, reps=100))
system.time(heatmap.fit(y, pred, reps=100, compress.obs=FALSE))
## End(Not run)