selgmented {segmented} | R Documentation |
Selecting the number of breakpoints in segmented regression
Description
This function selects (and estimates) the number of breakpoints of the segmented relationship according to the BIC/AIC criterion or sequential hypothesis testing.
Usage
selgmented(olm, seg.Z, Kmax=2, type = c("score", "bic", "davies", "aic"),
alpha = 0.05, control = seg.control(), refit = FALSE, stop.if = 5,
return.fit = TRUE, bonferroni = FALSE, msg = TRUE, plot.ic = FALSE, th = NULL,
G = 1, check.dslope = TRUE)
Arguments
olm |
A starting |
seg.Z |
A one-side formula for the segmented variable. Only one term can be included, and it can be omitted if |
Kmax |
The maximum number of breakpoints being tested. If |
type |
Which criterion should be used? Options |
alpha |
The fixed type I error probability when sequential hypothesis testing is carried out (i.e. |
control |
See |
refit |
If |
stop.if |
An integer. If, when trying models with an increasing (when |
return.fit |
If |
bonferroni |
If |
msg |
If |
plot.ic |
If |
th |
When a large number of breakpoints is being tested, it could happen that 2 estimated breakpoints are too close each other, and only one can be retained. Thus if the difference between two breakpoints is less or equal to |
G |
Number of sub-intervals to consider to search for the breakpoints when |
check.dslope |
Logical. Effective only if |
Details
The function uses properly the functions segmented
, pscore.test
or davies.test
to select the 'optimal' number of breakpoints 0,1,...,Kmax
. If type='bic'
or 'aic'
, the procedure stops if the last stop.if
fits have increasing values of the information criterion. Moreover, a breakpoint is removed if too close to other, actually if the difference between two consecutive estimates is less then th
. Finally, if check.dslope=TRUE
, breakpoints whose corresponding slope difference estimate is ‘small’ (i.e. -value larger then
alpha
or alpha/Kmax
) are also removed.
When the dataset is split into
groups, and the search is carried out separately within each group. This approach is fruitful when there are many breakpoints not evenly spaced in the covariate range and/or concentrated in some sub-intervals.
G=3
or 4
is recommended based on simulation evidence.
Note Kmax
is always tacitely reduced in order to have at least 1 residual df in the model with Kmax
changepoints. Namely, if , the maximal segmented model has
2*(Kmax + 1)
parameters, and therefore the largest Kmax
allowed is 8.
When type='score'
or 'davies'
, the function also returns the 'overall p-value' coming from combing the single p-values using the Fisher method. The pooled p-value, however, does not affect the final result which depends on the single p-values only.
Value
The returned object depends on argument return.fit
. If FALSE
, the returned object is a list with some information on the compared models (i.e. the BIC values), otherwise a classical 'segmented'
object (see segmented
for details) with the component selection.psi
including the A/BIC values and
- if refit=TRUE
, psi.no.refit
that represents the breakpoint values before the last fit (with boot restarting)
- if G>1
, cutvalues
including the cutoffs values used to split the data.
Note
If check.dslope=TRUE
, there is no guarantee that the final model has the lowest AIC/BIC. Namely the model with the best A/BIC could have ‘non-significant’ slope differences which will be removed (with the corresponding breakpoints) by the final model. Hence the possible plot (obtained via plot.ic=TRUE
) could be misleading. See Example 1 below.
Author(s)
Vito M. R. Muggeo
References
Muggeo V (2020) Selecting number of breakpoints in segmented regression: implementation in the R package segmented https://www.researchgate.net/publication/343737604
See Also
segmented
, pscore.test
, davies.test
Examples
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
os <-selgmented(out.lm) #selection (Kmax=2) via the Score test (default)
os <-selgmented(out.lm, type="bic", Kmax=3) #BIC-based selection
## Not run:
########################################
#Example 1: selecting a large number of breakpoints
b <- c(-1,rep(c(1.5,-1.5),l=15))
psi <- seq(.1,.9,l=15)
n <- 2000
x <- 1:n/n
X <- cbind(x, outer(x,psi,function(x,y)pmax(x-y,0)))
mu <- drop(tcrossprod(X,t(b)))
set.seed(113)
y<- mu + rnorm(n)*.022
par(mfrow=c(1,2))
#select number of breakpoints via the BIC (and plot it)
o<-selgmented(y, Kmax=20, type='bic', plot.ic=TRUE, check.dslope = FALSE)
plot(o, res=TRUE, col=2, lwd=3)
points(o)
# select via the BIC + check on the slope differences (default)
o1 <-selgmented(y, Kmax=20, type='bic', plot.ic=TRUE) #check.dslope = TRUE by default
#note the plot of BIC is misleading.. But the number of psi is correct
plot(o1, add=TRUE, col=3)
points(o1, col=3, pch=3)
##################################################
#Example 2: a large number of breakpoints not evenly spaced.
b <- c(-1,rep(c(2,-2),l=10))
psi <- seq(.5,.9,l=10)
n <- 2000
x <- 1:n/n
X <- cbind(x, outer(x,psi,function(x,y)pmax(x-y,0)))
mu <- drop(tcrossprod(X,t(b)))
y<- mu + rnorm(n)*.02
#run selgmented with G>1. G=3 or 4 recommended.
#note G=1 does not return the right number of breaks
o1 <-selgmented(y, type="bic", Kmax=20, G=4)
## End(Not run)