R: Run Models through a Sieve to Filter out Dubious Fits

modelFilter {dwp}

R Documentation

Run Models through a Sieve to Filter out Dubious Fits

Description

A set of fitted models (ddArray) is filtered according to a set of criteria that test for high AIC, high-influence points, and plausibility of the tail probabilities of each fitted distribution. modelFilter will either auto-select the best model according to a set of pre-defined, objective criteria or will will return all models that meet a set of user-defined, or default criteria. A table of how the models score according to each criterion is printed to the console.

Usage

modelFilter(dmod, sieve = "default", quiet = FALSE)

Arguments

`dmod`	a `ddArray` object
`sieve`	a list of criteria for ordering models
`quiet`	boolean to suppress (`quiet = TRUE`) or allow (`quiet = FALSE`) messages from `modelFilter`

Details

The criteria to test are entered in a list (sieve) with components:

$rtail = vector of probabilities that define a checkpoints on distributions to avoid situations where a model that may fit well within the range of data is nonetheless implausible because it predicts a significant or substantial probability of carcasses falling great distances from the nearest turbine. The default is to check whether or not a distribution predicts that less than 50% of carcasses fall within 80 meters, 90% within 120 meters, 95% within 150 meters, or 99% within 200 meters. Distributions that fall below any of these points (for example predicting only 42% within 80 meters or only 74% within 120 meters) fail the default rtail test. The format of the default for the test is $rtail = c(p80 = 0.5, p120 = 0.90, p150 = 0.95, p200 = 0.99). Users may override the default by using, for example, sieve = list(rtail = c(p80 = 0.8, p120 = 0.99, p150 = 0.99, p200 = 0.999)) in the argument list for a more stringent test or for a situation where turbines are small or winds are light. Alternatively, users may forego the test altogether by entering sieve = list(rtail = FALSE). If specific probabilities are provided, they must be in a vector of length 4 with names "p80" etc. as in the examples above.
$ltail = vector of probabilities that define checkpoints on distributions to avoid situations where the search radius is short and a distribution that fits the limited data set well but crashes to zero just outside the search radius. The default is to check whether or not a distribution predicts that greater than 50% of carcasses fall with 20 meters or 90% within 50 meters. Distributions that pass above either of these checkpoints (for example predicting 61% of carcasses within 20 meters or 93% within 50 meters) are eliminated by the default ltail test. The format of the default for the test is $ltail = c(p20 = 0.5, p50 = 0.90). Users may override the default by using, for example, sieve = list(rtail = c(p20 = 0.6, p50 = 0.8)) in the argument list for a situation where it is known that carcasses beyond 50 meters are common.
$aic = a numeric scalar cutoff value for model's delta AICc scores. Models with AICc scores exceeding the minimum AICc among all the fitted models by sieve$aic or more fail the test. The default value is 10. Users may override the default by using, for example, sieve = list(aic = 7) in the argument list to use a delta AIC score of 7 as the cutoff or may forego the test altogether by setting sieve = list(aic = FALSE)
$hin = TRUE or FALSE to test for high influence points, the presence of which cast doubt on the reliability of the model. The function defines "high influence" as models with high leverage points, namely, points with \frac{h}{1 - h} > \frac{2p}{n - 2p} (where h is leverage, p is the number of parameters in the model, and n is the search radius) with Cook's distance > 8/(n - 2*p). The criteria for high influence points were adapted from Brian Ripley's GLM diagnostics package boot (glm.diag). The test is perhaps most valuable in identifying distributions with high probability of carcasses landing well beyond what could reasonably be expected.

Several choices of pre-defined sieves are available (or, as described above, users may define their own criteria):

sieve = "default"

The models are ordered by the following criteria:

extensibility
weight of right tail (discounting models that predict implausibly high proportions of carcasses beyond the search radius)
weight of the left tail (discounting models that predict implausibly high proportions of carcasses near the turbines)
AICc test (discounting models with delta AICc > 10)
high influence points (discounting models in which one or more of the data points exert a high influence on the fitted model, according to Ripley's GLM diagnostics package boot (glm.diag))
ranking by AICc

Precise definitions of the default sieve parameters are given in sieve_default.

sieve = NULL

Returns a list of the extensible models without scoring them by other model selection criteria.

sieve = "win"

Sorts models by high-influence points and AICc

sieve = list(<custom>)

User provides a custom sieve, which may be a modification of the default sieve or de novo. To modify the default, use, for example, sieve = list(hin = FALSE) to disable the hin test but keep the other default tests, or sieve = list(aic = 7) to use 7 rather than 10 as the AIC cutoff, or sieve = list(ltail = c(p20 = 0.3, p50 = 0.8)) to use a more stringent left tail test that requires CDF graphs to pass below the points (20, 0.3) and (50, 0.8). Custom ltail and rtail parameters must match the formats of the default tests, but their probabilities may vary. To turn off the aic filter, use sieve = list(aic = Inf). To turn off the ltail filter, use sieve = list(ltail = c(p20 = 1, p50 = 1)). To turn off the rtail filter, use sieve = list(rtail = c(p80 = 0, p120 = 0, p150 = 0, p200 = 0)). These custom components may be mixed and matched as desired.

Value

An fmod object, which is an unordered list of extensible models if sieve = NULL; otherwise, a list of class fmod with following components:

$filtered

the selected dd object or a ddArray list of models that passed the tests

$scores

a matrix with all models tested (rownames = model names) and the results of each test (columns aic_test, rtail, ltail, hin, aic)

$sieve

the test criteria, stored in a list with

$aic_test = cutoff for AIC
$hin = boolean to indicate whether high influence points were considered
$rtail = numeric vector giving the probabilities that the right tail of the distribution must exceed at distances of 80, 120, 150, and 200 meters in order to pass
$ltail = numeric vector giving the probabilities that the left tail of the distribution must NOT exceed at distances of 20 and 50 meters in order to pass

models

a list (ddArray object) of all models tested

note

notes on the tests

When a fmod object is printed, only a small subset of the elements are shown. To see a full list of the objects, use names(x), where x is the name of the fmod return value. The elements can be extracted in the usual R way via, for example, x$sieve or x[["sieve"]].

Examples

 data(layout_simple)
 data(carcass_simple)
 sitedata <- initLayout(layout_simple)
 ringdata <- prepRing(sitedata)
 ringsWithCarcasses <- addCarcass(carcass_simple, data_ring = ringdata)
 distanceModels <- ddFit(ringsWithCarcasses)
 stats(distanceModels)
 stats(distanceModels[["tnormal"]])
 stats(distanceModels[["lognormal"]])

[Package dwp version 1.1 Index]