R: Fit a Linear Model with a Breakpoint

'lm.br' {lm.br}

R Documentation

Fit a Linear Model with a Breakpoint

Description

Exact significance tests for a changepoint in linear or multiple linear regression. Confidence intervals and confidence regions with exact coverage probabilities for the changepoint.

Usage

lm.br(formula, type ="LL", data, subset, weights, inverse =FALSE,
      var.known =FALSE, na.action, contrasts, offset, ...)

Arguments

`formula`	a formula expression of the form `response ~ predictors`, the same as for regression models; see `formula`
`type`	"LL", "LT" or "TL" which stand for line-line, line-threshold or threshold-line, defined below
`data`	an optional data-frame that assigns values in `formula`
`subset`	expression saying which subset of the data to use
`weights`	vector or matrix
`inverse`	if TRUE then 'weights' specifies the inverse of the weights vector or matrix, as for a covariance matrix
`var.known`	is the variance known?
`na.action`	a function to filter missing data
`contrasts`	an optional list; see 'contrasts.arg' in `model.matrix`
`offset`	a constant vector to be subtracted from the responses vector
`...`	other arguments to `lm.fit` or `lm.wfit`

Details

A broken-line model consists of two straight lines joined at a changepoint. Three versions are

LL   y = alpha + B * min(x - theta, 0) + Bp * max(x - theta, 0) + e

LT   y = alpha + B * min(x - theta, 0) + e

TL   y = alpha + Bp * max(x - theta, 0) + e

where e ~ Normal( 0, var * inv(weights) ). The LT and TL versions omit 'alpha' if the formula is without intercept, such as 'y~x+0'. Parameters 'theta', 'alpha', 'B', 'Bp', 'var' are unknown, but 'weights' is known.

The same models apply for a multiple-regression formula such as 'y ~ x1 + x2 + ... + xn' where 'alpha' becomes the coefficient of the "1"-vector and 'theta' the changepoint for the coefficient of the first predictor term, 'x1'.

The test for the presence of a changepoint is by a postulate value outside the range of 'x'-values. Thus, in the LL model 'sl( min(x1) - 1 )' would give the exact significance level of the null hypothesis "single line" versus the alternate hypothesis "broken line."

Exact inferences about the changepoint 'theta' or '(theta,alpha)' are based on the distribution of its likelihood-ratio statistic, conditional on sufficient statistics for the other parameters. This method is called conditional likelihood-ratio (CLR) for short.

Value

'lm.br' returns a list that includes a C++ object with accessor functions. Functions sl, ci and cr get significance levels, confidence intervals, and confidence regions for the changepoint's x-coordinate or (x,y)-coordinates. Other functions are mle to get maximum likelihood estimates and sety to set new y-values. The returned object also lists 'coefficients', 'fitted.values' and 'residuals', the same as for an 'lm' output list.

Note

Data can include more than one 'y' value for a repeat 'x' value. If variance is known, then 'var' = 1 and 'weights' is the inverse of the variances vector or variance-covariance matrix.

References

Knowles, M., Siegmund, D. and Zhang, H.P. (1991) Confidence regions in semilinear regression, _Biometrika_, *78*, 15-31.

Siegmund, D. and Zhang, H.P. (1994), Confidence regions in broken line regression, in "Change-point Problems", _IMS Lecture Notes – Monograph Series_, *23*, eds. E. Carlstein, H. Muller and D. Siegmund, Hayward, CA: Institute of Mathematical Statistics, 292-316.

Examples

#  Smith & Cook (1980), "Straight Lines with a Change-point: A Bayesian
#  Analysis of some Renal Transplant Data", Appl Stat, *29*, 180-189,  
#  reciprocal of blood creatinine L/micromol  vs  day after transplant.
creatinine <- c(37.3, 47.1, 51.5, 67.6, 75.9, 73.3, 69.4, 61.5, 31.8, 19.4)
day <- 1:10
sc <- lm.br( creatinine ~ day )
sc $ mle()
sc $ ci()
sc $ sl( day[1] - 1.5 )      # test for the presence of a changepoint
plot( sc$residuals )


#  A 'TL' example, data from figure 1 in Chiu et al. (2006), "Bent-cable
#  regression theory and applications", J Am Stat Assoc, *101*, 542-553,
#  log(salmon abundance) vs year.
salmon <- c( 2.50, 2.93, 2.94, 2.83, 2.43, 2.84, 3.06, 2.97, 2.94, 2.65,
  2.92, 2.71, 2.93, 2.60, 2.12, 2.08, 1.81, 2.45, 1.71, 0.55, 1.30 )
year <- 1980 : 2000
chiu <- lm.br( salmon ~ year, 'tl' )
chiu $ ci()


#  A multiple regression example, using an R dataset,
#  automobile miles-per-gallon  versus  weight and horsepower.
lm.br( mpg ~ wt + hp,  data = mtcars )


#  An example with variance known, for the Normal approximations of binomial
#  random variables using formula 2.28 of Cox and Snell (1989).
#    Ex. 3.4 of Freeman (2010) "Inference for binomial changepoint data" in
# _Advances in Data Analysis_, ed. C Skiadas, Boston: Birkhauser, 345-352.
trials <- c( 15, 82, 82, 77, 38, 81, 12, 97, 33, 75,
  85, 37, 44, 96, 76, 26, 91, 47, 41, 35 )
successes <- c( 8, 44, 47, 39, 24, 38, 3, 51, 16, 43,
  47, 27, 33, 64, 41, 18, 61, 32, 33, 24 )
log_odds <- log( (successes - 0.5)/(trials - successes - 0.5) )
variances <- (trials-1)/( successes*(trials-successes) )
group <- 1 : 20
lm.br( log_odds ~ group, 'TL', w= variances, inv= TRUE, var.known= TRUE )


#  An example that shows different confidence regions from inference by
#  conditional likelihood-ratio (CLR)  versus  approximate-F (AF).
y <- c( 1.6, 3.2, 6.3, 4.8, 4.3, 4.0, 3.5, 1.8 )
x <- 1:8
eg <- lm.br( y ~ x )
eg$cr( output='t' )
eg$cr( method = 'aF', output='t' )

[Package lm.br version 2.9.7 Index]