R: Multivariate Adaptive Regression Splines

earth {earth}

R Documentation

Multivariate Adaptive Regression Splines

Description

Build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS".

See the package vignette “Notes on the earth package”.

Usage

## S3 method for class 'formula'
earth(formula = stop("no 'formula' argument"), data = NULL,
   weights = NULL, wp = NULL, subset = NULL,
   na.action = na.fail,
   pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"),
   keepxy = FALSE, trace = 0, glm = NULL, degree = 1, nprune = NULL,
   nfold=0, ncross=1, stratify=TRUE,
   varmod.method = "none", varmod.exponent = 1,
   varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -3,
   Scale.y = NULL, ...)

## Default S3 method:
earth(x = stop("no 'x' argument"), y = stop("no 'y' argument"),
    weights = NULL, wp = NULL, subset = NULL,
    na.action = na.fail,
    pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"),
    keepxy = FALSE, trace = 0, glm = NULL, degree = 1, nprune = NULL,
    nfold=0, ncross=1, stratify=TRUE,
    varmod.method = "none", varmod.exponent = 1,
    varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -3,
    Scale.y = NULL, ...)

## S3 method for class 'fit'
earth(x = stop("no 'x' argument"), y = stop("no 'y' argument"),
    weights = NULL, wp = NULL, subset = NULL,
    na.action = na.fail, offset = NULL,
    pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"),
    keepxy = FALSE, trace = 0, glm = NULL, degree = 1,
    penalty = if(degree > 1) 3 else 2,
    nk = min(200, max(20, 2 * ncol(x))) + 1,
    thresh = 0.001, minspan = 0, endspan = 0,
    newvar.penalty = 0, fast.k = 20, fast.beta = 1,
    linpreds = FALSE, allowed = NULL,
    nprune = NULL, Object = NULL,
    Scale.y = NULL, Adjust.endspan = 2, Auto.linpreds = TRUE,
    Force.weights = FALSE, Use.beta.cache = TRUE, Force.xtx.prune = FALSE,
    Get.leverages = NROW(x) < 1e5, Exhaustive.tol = 1e-10, ...)

Arguments

To start off, look at the arguments formula, data, x, y, nk, degree, and trace.
If the response is binary or a factor, consider using the glm argument.
For cross validation, use the nfold argument.
For prediction intervals, use the varmod.method argument.

Most users will find that the above arguments are all they need, plus in some cases keepxy and nprune. Unless you are a knowledgeable user, it's best not subvert the standard algorithm by toying with tuning parameters such as thresh, penalty, and endspan.

`formula`	Model formula.
`data`	Data frame for `formula`.
`x`	Matrix or dataframe containing the independent variables.
`y`	Vector containing the response variable, or, in the case of multiple responses, a matrix or dataframe whose columns are the values for each response.
`subset`	Index vector specifying which cases to use, i.e., which rows in `x` to use. Default is NULL, meaning all.
`weights`	Case weights. Default is NULL, meaning no case weights. If specified, `weights` must have length equal to `nrow(x)` before applying `subset`. Zero weights are converted to a very small nonzero value. In the current implementation, building models with weights can be slow.
`wp`	Response weights. Default is NULL, meaning no response weights. If specified, `wp` must have an element for each column of `y` (after `factors` in `y`, if any, have been expanded). Zero weights are converted to a very small nonzero value.
`na.action`	NA action. Default is `na.fail`, and only `na.fail` is supported.
`offset`	Offset term passed from the formula in `earth.formula`.
`keepxy`	Default is `FALSE`. Set to `TRUE` to retain the following in the returned value: `x` and `y` (or `data`), `subset`, and `weights`. The function `update.earth` and friends will use these if present instead of searching for them in the environment at the time `update.earth` is invoked. When the `nfold` argument is used with `keepxy=TRUE`, `earth` keeps more data and calls `predict.earth` multiple times to generate `cv.oof.rsq.tab` and `cv.infold.rsq.tab` (see the `cv.` arguments in `earth.object`). It therefore makes cross-validation significantly slower.
`trace`	Trace `earth`'s execution. Values: `0` (default) no tracing `.3` variance model (the `varmod.method` arg) `.5` cross validation (the `nfold` arg) `1` overview `2` forward pass `3` pruning `4` model mats summary, pruning details `5` full model mats, internal details of operation
`glm`	NULL (default) or a list of arguments to pass on to `glm`. See the documentation of `glm` for a description of these arguments See “Generalized linear models” in the vignette. Example: `earth(survived~., data=etitanic, degree=2, glm=list(family=binomial))` The following arguments are for the forward pass.
`degree`	Maximum degree of interaction (Friedman's `mi`). Default is `1`, meaning build an additive model (i.e., no interaction terms).
`penalty`	Generalized Cross Validation (GCV) penalty per knot. Default is `if(degree>1) 3 else 2`. Simulation studies suggest values in the range of about `2` to `4`. The FAQ section in the vignette has some information on GCVs. Special values (for use by knowledgeable users): The value `0` penalizes only terms, not knots. The value `-1` means no penalty, so GCV = RSS/n.
`nk`	Maximum number of model terms before pruning, i.e., the maximum number of terms created by the forward pass. Includes the intercept. The actual number of terms created by the forward pass will often be less than `nk` because of other stopping conditions. See “Termination conditions for the forward pass” in the vignette. The default is semi-automatically calculated from the number of predictors but may need adjusting.
`thresh`	Forward stepping threshold. Default is `0.001`. This is one of the arguments used to decide when forward stepping should terminate: the forward pass terminates if adding a term changes RSq by less than `thresh`. See “Termination conditions for the forward pass” in the vignette.
`minspan`	Minimum number of observations between knots. (This increases resistance to runs of correlated noise in the input data.) The default `minspan=0` is treated specially and means calculate the `minspan` internally, as per Friedman's MARS paper section 3.8 with `alpha` = 0.05. Set `trace>=2` to see the calculated value. Use `minspan=1` and `endspan=1` to consider all x values. Negative values of `minspan` specify the maximum number of knots per predictor. These will be equally spaced. For example, `minspan=-3` allows three evenly spaced knots for each predictor. As always, knots that fall in the end zones specified by `endspan` will be ignored.
`endspan`	Minimum number of observations before the first and after the final knot. The default `endspan=0` is treated specially and means calculate the `endspan` internally, as per the MARS paper equation 45 with `alpha` = 0.05. Set `trace>=2` to see the calculated value. Be wary of reducing `endspan`, especially if you plan to make predictions beyond or near the limits of the training data. Overfitting near the edges of training data is much more likely with a small `endspan`. The model's `RSq` and `GRSq` won't indicate when this overfitting is occurring. (A `plotmo` plot can help: look for sharp hinges at the edges of the data). See also the `Adjust.endspan` argument.
`newvar.penalty`	Penalty for adding a new variable in the forward pass (Friedman's `gamma`, equation 74 in the MARS paper). Default is `0`, meaning no penalty for adding a new variable. Useful non-zero values typically range from about `0.01` to `0.2` and sometimes higher — you will need to experiment. A word of explanation. With the default `newvar.penalty=0`, if two variables have nearly the same effect (e.g. they are collinear), at any step in the forward pass `earth` will arbitrarily select one or the other (depending on noise in the sample). Both variables can appear in the final model, complicating model interpretation. On the other hand with a non-zero `newvar.penalty`, the forward pass will be reluctant to add a new variable — it will rather try to use a variable already in the model, if that does not affect RSq too much. The resulting final model may be easier to interpret, if you are lucky. There will often be a small performance hit (a worse GCV).
`fast.k`	Maximum number of parent terms considered at each step of the forward pass. (This speeds up the forward pass. See the Fast MARS paper section 3.0.) Default is `20`. A value of `0` is treated specially (as being equivalent to infinity), meaning no Fast MARS. Typical values, apart from `0`, are `20`, `10`, or `5`. In general, with a lower `fast.k` (say `5`), `earth` is faster; with a higher `fast.k`, or with `fast.k` disabled (set to `0`), `earth` builds a better model. However, because of random variation this general rule often doesn't apply.
`fast.beta`	Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is `1`. A value of `0` sometimes gives better results.
`linpreds`	Index vector specifying which predictors should enter linearly, as in `lm`. The default is `FALSE`, meaning predictors enter in the standard MARS fashion, i.e., in hinge functions. The linpreds argument does not specify that a predictor must enter the model; only that if it enters, it enters linearly. See “The linpreds argument” in the vignette. See also the `Auto.linpreds` argument below (which describes how `earth` will automatically treat a predictor as linear under certain conditions). Details: A predictor's index in `linpreds` is the column number in the input matrix `x` (after factors have been expanded). `linpreds=TRUE` makes all predictors enter linearly (the `TRUE` gets recycled). `linpreds` may be a character vector e.g. `linpreds=c("wind", "vis")`. Note: `grep` is used for matching. Thus `"wind"` will match all variables that have `"wind"` in their names. Use `"^wind$"` to match only the variable named `"wind"`.
`allowed`	Function specifying which predictors can interact and how. Default is NULL, meaning all standard MARS terms are allowed. During the forward pass, `earth` calls the `allowed` function before considering a term for inclusion; the term can go into the model only if the `allowed` function returns `TRUE`. See “The allowed argument” in the vignette. The following arguments are for the pruning pass.
`pmethod`	Pruning method. One of: `backward none exhaustive forward seqrep cv`. Default is `"backward"`. Specify `pmethod="cv"` to use cross-validation to select the number of terms. This selects the number of terms that gives the maximum mean out-of-fold RSq on the fold models. Requires the `nfold` argument. Use `"none"` to retain all the terms created by the forward pass. If `y` has multiple columns, then only `"backward"` or `"none"` is allowed. Pruning can take a while if `"exhaustive"` is chosen and the model is big (more than about 30 terms). The current version of the `leaps` package used during pruning does not allow user interrupts (i.e., you have to kill your R session to interrupt; in Windows use the Task Manager or from the command line use `taskkill`).
`nprune`	Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms created by the forward pass (but typically not all terms will remain after pruning). Use this to enforce an upper bound on the model size (that is less than `nk`), or to reduce exhaustive search time with `pmethod="exhaustive"`. The following arguments are for cross validation.
`nfold`	Number of cross-validation folds. Default is `0`, no cross validation. If greater than `1`, `earth` first builds a standard model as usual with all the data. It then builds `nfold` cross-validated models, measuring R-Squared on the out-of-fold (left out) data each time. The final cross validation R-Squared (`CVRSq`) is the mean of these out-of-fold R-Squareds. The above process of building `nfold` models is repeated `ncross` times (by default, once). Use `trace=.5` to trace cross-validation. Further statistics are calculated if `keepxy=TRUE` or if a binomial or poisson model (specified with the `glm` argument). See “Cross validation” in the vignette.
`ncross`	Only applies if `nfold>1`. Number of cross-validations. Each cross-validation has `nfold` folds. Default `1`.
`stratify`	Only applies if `nfold>1`. Default is `TRUE`. Stratify the cross-validation samples so that an approximately equal number of cases with a non-zero response occur in each cross validation subset. So if the response `y` is logical, the `TRUE`s will be spread evenly across folds. And if the response is a multilevel factor, there will be an approximately equal number of each factor level in each fold (because a multilevel factor response gets expanded to columns of zeros and ones, see “Factors” in the vignette). We say “approximately equal” because the number of occurrences of a factor level may not be exactly divisible by the number of folds. The following arguments are for variance models.
`varmod.method`	Construct a variance model. For details, see `varmod` and the vignette “Variance models in earth”. Use `trace=.3` to trace construction of the variance model. This argument requires `nfold` and `ncross`. (We suggest at least `ncross=30` here to properly calculate the variance of the errors — although you can use a smaller value, say `3`, for debugging.) The `varmod.method` argument should be one of `"none"` Default. Don't build a variance model. `"const"` Assume homoscedastic errors. `"lm"` Use `lm` to estimate standard deviation as a function of the predicted response. `"rlm"` Use `rlm`. `"earth"` Use `earth`. `"gam"` Use `gam`. This will use either `gam` or the `mgcv` package, whichever is loaded. `"power"` Estimate standard deviation as `intercept + coef * predicted.response^exponent`, where `intercept`, `coef`, and `exponent` will be estimated by `nls`. This is equivalent to `varmod.method="lm"` except that `exponent` is automatically estimated instead of being held at the value set by the `varmod.exponent` argument. `"power0"` Same as `"power"` but no intercept (offset) term. `"x.lm"`, `"x.rlm"`, `"x.earth"`, `"x.gam"` Like the similarly named options above, but estimate standard deviation by regressing on the predictors `x` (instead of the predicted response). A current implementation restriction is that `"x.gam"` allows only models with one predictor (`x` must have only one column).
`varmod.exponent`	Power transform applied to the rhs before regressing the absolute residuals with the specified `varmod.method`. Default is `1`. For example, with `varmod.method="lm"`, if you expect the standard deviance to increase linearly with the mean response, use `varmod.exponent=1`. If you expect the standard deviance to increase with the square root of the mean response, use `varmod.exponent=.5` (where negative response values will be treated as `0`, and you will get an error message if more than 20% of them are negative).
`varmod.conv`	Convergence criterion for the Iteratively Reweighted Least Squares used when creating the variance model. Iterations stop when the mean value of the coefficients of the residual model change by less than `varmod.conv` percent. Default is `1` percent. Negative values force the specified number of iterations, e.g. `varmod.conv=-2` means iterate twice. Positive values are ignored for `varmod="const"` and also currently ignored for `varmod="earth"` (these are iterated just once, the same as using `varmod.conv=-1`).
`varmod.clamp`	The estimated standard deviation of the main model errors is forced to be at least a small positive value, which we call `min.sd`. This prevents negative or absurdly small estimated standard deviations. Clamping takes place in `predict.varmod`, which is called by `predict.earth` when estimating prediction intervals. The value of `min.sd` is determined when building the variance model as `min.sd = varmod.clamp * mean(sd(training.residuals))`. The default `varmod.clamp` is `0.1`.
`varmod.minspan`	Only applies when `varmod.method="earth"` or `"x.earth"`. This is the `minspan` used in the internal call to `earth` when creating the variance model (not the main `earth` model). Default is `-3`, i.e., three evenly spaced knots per predictor. Residuals tend to be very noisy, and allowing only this small number of knots helps prevent overfitting. The following arguments are for internal or advanced use.
`Object`	Earth object to be updated, for use by `update.earth`.
`Scale.y`	`Scale` the response internally in the forward pass. Scaling here means subtract the mean and divide by the standard deviation. For single-response models, the default is `Scale.y = TRUE`. Scaling is invisible to the user, up to numerical differences, but does provide better numeric stability. For multiple-response models, the default is `FALSE`. If `Scale.y` is set `TRUE`, each column of the response is independently scaled. This can prevent one response from “overwhelming” the others, and earth typically generates a different set of hinge functions.
`Adjust.endspan`	In interaction terms, `endspan` gets multiplied by this value. This reduces the possibility of an overfitted interaction term supported by just a few cases on the boundary of the predictor space (as sometimes seen in our simulation studies). The default is `2`. Use `Adjust.endspan=1` for compatibility with old versions of `earth`.
`Auto.linpreds`	Default is `TRUE`, which works as follows (see example): At any step in the forward pass, if earth discovers that the best knot for the best predictor is at the predictor minimum (in the training data), then earth adds the predictor to the model as a linear “basis function” (with no hinge). Compare the following basis functions (printed in bold) for an example such predictor `x`: `Auto.linpreds=TRUE` (default): `x` `Auto.linpreds=FALSE`: `max(x-99, 0)` where `99` is the minimum `x` in the training data. Using `Auto.linpreds=FALSE` always forces a knot, even when the knot is at the minimum value of the variable. This ensures that the basis functions are always expressed as hinge functions (and will always be non-negative). Note that `Auto.linpreds` affects only how the model behaves outside the training data. Thus `predict.earth` will make the same predictions from the training data, regardless of whether the earth model was built with `Auto.linpreds` set `TRUE` or `FALSE` (up to possible differences in the size of the model caused by different GCVs because of the different forms of the terms).
`Force.weights`	Default is `FALSE`. For testing the `weights` argument. Force use of the code for handling weights in the `earth` code, even if `weights=NULL` or all the weights are the same. This will not necessarily generate an identical model, primarily because the non-weighted code requires some tests for numerical stability that can sometimes affect knot selection.
`Use.beta.cache`	Default is `TRUE`. Using the “beta cache” takes a little more memory but is faster (by 20% and often much more for large models). The beta cache uses `nk * nk * ncol(x) * sizeof(double)` bytes. (The beta cache is an innovation in this implementation of MARS and does not appear in Friedman's papers. It is not related to the `fast.beta` argument. Certain regression coefficients in the forward pass can be saved and re-used, thus saving recalculation time.)
`Force.xtx.prune`	Default is `FALSE`. This argument pertains to subset evaluation in the pruning pass. By default, if `y` has a single column then `earth` calls the `leaps` routines; if `y` has multiple columns then `earth` calls `EvalSubsetsUsingXtx`. The `leaps` routines are numerically more stable but do not support multiple responses (`leaps` is based on the QR decomposition and `EvalSubsetsUsingXtx` is based on the inverse of X'X). Setting `Force.xtx.prune=TRUE` forces use of `EvalSubsetsUsingXtx`, even if `y` has a single column.
`Get.leverages`	Default is `TRUE` unless the model has more than 100 thousand cases. The leverages are the diagonal hat values for the linear regression of `y` on `bx`. (The leverages are needed only for certain model checks, for example when `plotres` is called with `versus=4`). Details: This argument was introduced to reduce peak memory usage. When `n >> p`, memory use peaks when `earth` is calculating the leverages.
`Exhaustive.tol`	Default `1e-10`. Applies only when `pmethod="exhaustive"`. If the reciprocal of the condition number of `bx` is less than `Exhaustive.tol`, `earth` forces `pmethod="backward"`. See “XHAUST returned error code -999” in the vignette.
`...`	Dots are passed on to `earth.fit`.

Value

An S3 model of class "earth". See earth.object for a complete description.

Author(s)

Stephen Milborrow, derived from mda::mars by Trevor Hastie and Robert Tibshirani.

The approach used for GLMs was motivated by work done by Jane Elith and John Leathwick (a representative paper is given below).

The evimp function uses ideas from Max Kuhn's caret package https://CRAN.R-project.org/package=caret.

Parts of Thomas Lumley's leaps package have been incorporated into earth, so earth can directly access Alan Miller's Fortran functions without going through hidden functions in the leaps package.

References

The Wikipedia article is recommended for an elementary introduction. The primary references are the Friedman papers, but readers may find the MARS section in Hastie, Tibshirani, and Friedman a more accessible introduction. Faraway takes a hands-on approach, using the ozone data to compare mda::mars with other techniques. (If you use Faraway's examples with earth instead of mars, use $bx instead of $x, and check out the book's errata.) Friedman and Silverman is recommended background reading for the MARS paper. Earth's pruning pass uses code from the leaps package which is based on techniques in Miller.

Faraway (2005) Extending the Linear Model with R https://www.maths.bath.ac.uk/~jjf23

Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1–141 http://projecteuclid.org/euclid.aos/1176347963
doi:10.1214/aos/1176347963

Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 https://statistics.stanford.edu/research/fast-mars

Friedman and Silverman (1989) Flexible Parsimonious Smoothing and Additive Modeling Technometrics, Vol. 31, No. 1.

Hastie, Tibshirani, and Friedman (2009) The Elements of Statistical Learning (2nd ed.) https://hastie.su.domains/pub.htm

Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005) Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052 https://hastie.su.domains/pub.htm

Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression https://wp.csiro.au/alanmiller/index.html

Wikipedia article on MARS https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

Examples

earth.mod <- earth(Volume ~ ., data = trees)
plotmo(earth.mod)
summary(earth.mod, digits = 2, style = "pmax")

[Package earth version 5.3.3 Index]