R: Iterative bias reduction smoothing

ibr {ibr}

R Documentation

Iterative bias reduction smoothing

Description

Performs iterative bias reduction using kernel, thin plate splines Duchon splines or low rank splines. Missing values are not allowed.

Usage

ibr(formula, data, subset, criterion="gcv", df=1.5, Kmin=1, Kmax=1e+06, smoother="k",
 kernel="g", rank=NULL, control.par=list(), cv.options=list())

Arguments

`formula`	An object of class `"formula"` (or one that can be coerced to that class): a symbolic description of the model to be fitted.
`data`	An optional data frame, list or environment (or object coercible by `as.data.frame` to a data frame) containing the variables in the model. If not found in `data`, the variables are taken from `environment(formula)`, typically the environment from which `ibr` is called.
`subset`	An optional vector specifying a subset of observations to be used in the fitting process.
`criterion`	A vector of string. If the number of iterations (`iter`) is missing or `NULL` the number of iterations is chosen using the either one criterion (the first coordinate of `criterion`) or several (see component `criterion` of argument list `control.par`). The criteria available are GCV (default, `"gcv"`), AIC (`"aic"`), corrected AIC (`"aicc"`), BIC (`"bic"`), gMDL (`"gmdl"`), map (`"map"`) or rmse (`"rmse"`). The last two are designed for cross-validation.
`df`	A numeric vector of either length 1 or length equal to the number of columns of `x`. If `smoother="k"`, it indicates the desired effective degree of freedom (trace) of the smoothing matrix for each variable or for the initial smoother (see `contr.sp$dftotal`); `df` is repeated when the length of vector `df` is 1. If `smoother="tps"` or `smoother="ds"`, the minimum df of splines is multiplied by `df`. This argument is useless if `bandwidth` is supplied (non null).
`Kmin`	The minimum number of bias correction iterations of the search grid considered by the model selection procedure for selecting the optimal number of iterations.
`Kmax`	The maximum number of bias correction iterations of the search grid considered by the model selection procedure for selecting the optimal number of iterations.
`smoother`	Character string which allows to choose between thin plate splines `"tps"`, Duchon splines `"tps"` (see Duchon, 1977) or kernel (`"k"`).
`kernel`	Character string which allows to choose between gaussian kernel (`"g"`), Epanechnikov (`"e"`), uniform (`"u"`), quartic (`"q"`). The default (gaussian kernel) is strongly advised.
`rank`	Numeric value that control the rank of low rank splines (denoted as `k` in mgcv package ; see also choose.k for further details or gam for another smoothing approach with reduced rank smoother.
`control.par`	A named list that control optional parameters. The components are `bandwidth` (default to NULL), `iter` (default to NULL), `really.big` (default to `FALSE`), `dftobwitmax` (default to 1000), `exhaustive` (default to `FALSE`),`m` (default to NULL), ,`s` (default to NULL), `dftotal` (default to `FALSE`), `accuracy` (default to 0.01), `ddlmaxi` (default to 2n/3), `fraction` (default to `c(100, 200, 500, 1000, 5000, 10^4, 5e+04, 1e+05, 5e+05, 1e+06)`), `scale` (default to `FALSE`), `criterion` (default to `"strict"`) and `aggregfun` (default to 10^(floor(log10(x[2]))+2)). `bandwidth`: a vector of either length 1 or length equal to the number of columns of `x`. If `smoother="k"`, it indicates the bandwidth used for each variable, bandwidth is repeated when the length of vector `bandwidth` is 1. If `smoother="tps"`, it indicates the amount of penalty (coefficient lambda). The default (missing) indicates, for `smoother="k"`, that bandwidth for each variable is chosen such that each univariate kernel smoother (for each explanatory variable) has `df` effective degrees of freedom and for `smoother="tps"` or `smoother="ds"` that lambda is chosen such that the df of the smoothing matrix is `df` times the minimum df. `iter`: the number of iterations. If null or missing, an optimal number of iterations is chosen from the search grid (integer from `Kmin` to `Kmax`) to minimize the `criterion`. `really.big`: a boolean: if `TRUE` it overides the limitation at 500 observations. Expect long computation times if `TRUE`. `dftobwitmax`: When bandwidth is chosen by specifying the effective degree of freedom (see `df`) a search is done by `uniroot`. This argument specifies the maximum number of iterations transmitted to `uniroot` function. `exhaustive`: boolean, if `TRUE` an exhaustive search of optimal number of iteration on the grid `Kmin:Kmax` is performed. All criteria for all iterations in the same class (class one: GCV, AIC, corrected AIC, BIC, gMDL ; class two : MAP, RMSE) are returned in argument `allcrit`. If `FALSE` the minimum of criterion is searched using `optimize` between `Kmin` and `Kmax`. `m`: The order of derivatives for the penalty (for thin plate splines it is the order). This integer m must verify 2m+2s/d>1, where d is the number of explanatory variables. The default (for `smoother="tps"`) is to choose the order m as the first integer such that 2m/d>1, where d is the number of explanatory variables. The default (for `smoother="ds"`) is to choose m=2 (p seudo cubic splines). `s`: the power of weighting function. For thin plate splines s is equal to 0. This real must be strictly smaller than d/2 (where d is the number of explanatory variables) and must verify 2m+2s/d. To get pseudo-cubic splines (the default), choose m=2 and s=(d-1)/2 (See Duchon, 1977).the order of thin plate splines. This integer m must verifies 2m/d>1, where d is the number of explanatory variables. `dftotal`: a boolean wich indicates when `FAlSE` that the argument `df` is the objective df for each univariate kernel (the default) calculated for each explanatory variable or for the overall (product) kernel, that is the base smoother (when `TRUE`). `accuracy`: tolerance when searching bandwidths which lead to a chosen overall intial df. `dfmaxi`: the maximum effective degree of freedom allowed for iterated biased reduction smoother. `fraction`: the subdivision of interval `Kmin`,`Kmax` if non exhaustive search is performed (see also `iterchoiceA` or `iterchoiceS1`). `scale`: boolean. If `TRUE` `x` is scaled (using `scale`); default to `FALSE`. `criterion` Character string. Possible choices are `strict`, `aggregation` or `recalc`. `strict` allows to select the number of iterations according to the first coordinate of argument `criterion`. `aggregation` allows to select the number of iterations by applying the function `control.par$aggregfun` to the number of iterations selected by all the criteria chosen in argument `criterion`. `recalc` allows to select the number of iterations by first calculating the optimal number of the second coordinate of argument `criterion`, then applying the function `control.par$aggregfun` (to add some number to it) resulting in a new `Kmax` and then doing the optimal selction between `Kmin` and this new `Kmax` using the first coordinate of argument `criterion`. ; default to `strict`. `aggregfun` function to be applied when `control.par$criterion` is either `recalc` or `aggregation`.
`cv.options`	A named list which controls the way to do cross validation with component `bwchange`, `ntest`, `ntrain`, `Kfold`, `type`, `seed`, `method` and `npermut`. `bwchange` is a boolean (default to `FALSE`) which indicates if bandwidth have to be recomputed each time. `ntest` is the number of observations in test set and `ntrain` is the number of observations in training set. Actually, only one of these is needed the other can be `NULL` or missing. `Kfold` a boolean or an integer. If `Kfold` is `TRUE` then the number of fold is deduced from `ntest` (or `ntrain`). `type` is a character string in `random`,`timeseries`,`consecutive`, `interleaved` and give the type of segments. `seed` controls the seed of random generator. `method` is either `"inmemory"` or `"outmemory"`; `"inmemory"` induces some calculations outside the loop saving computational time but leading to an increase of the required memory. `npermut` is the number of random draws. If `cv.options` is `list()`, then component `ntest` is set to `floor(nrow(x)/10)`, `type` is random, `npermut` is 20 and `method` is `"inmemory"`, and the other components are `NULL`

Value

Returns an object of class ibr which is a list including:

`beta`	Vector of coefficients.
`residuals`	Vector of residuals.
`fitted`	Vector of fitted values.
`iter`	The number of iterations used.
`initialdf`	The initial effective degree of freedom of the pilot (or base) smoother.
`finaldf`	The effective degree of freedom of the iterated bias reduction smoother at the `iter` iterations.
`bandwidth`	Vector of bandwith for each explanatory variable
`call`	The matched call
`parcall`	A list containing several components: `p` contains the number of explanatory variables and `m` the order of the splines (if relevant), `s` the power of weights, `scaled` boolean which is `TRUE` when explanatory variables are scaled, `mean` mean of explanatory variables if `scaled=TRUE`, `sd` standard deviation of explanatory variables if `scaled=TRUE`, `critmethod` that indicates the method chosen for criteria `strict`, `rank` the rank of low rank splines if relevant, `criterion` the chosen criterion, `smoother` the chosen smoother, `kernel` the chosen kernel, `smoothobject` the smoothobject returned by smoothCon, `exhaustive` a boolean which indicates if an exhaustive search was chosen
`criteria`	Value of the chosen criterion at the given iteration, `NA` is returned when aggregation of criteria is chosen (see component `criterion` of list `control.par`). If the number of iterations `iter` is given by the user, `NULL` is returned
`alliter`	Numeric vector giving all the optimal number of iterations selected by the chosen criteria.
`allcriteria`	either a list containing all the criteria evaluated on the grid `Kmin:Kmax` (along with the effective degree of freedom of the smoother and the sigma squared on this grid) if an exhaustive search is chosen (see the value of function `iterchoiceAe` or `iterchoiceS1e`) or all the values of criteria at the given optimal iteration if a non exhaustive search is chosen (see also `exhaustive` component of list `control.par`).
`call`	The matched call.
`terms`	The 'terms' object used.

Author(s)

Pierre-Andre Cornillon, Nicolas Hengartner and Eric Matzner-Lober.

References

Cornillon, P.-A.; Hengartner, N.; Jegou, N. and Matzner-Lober, E. (2012) Iterative bias reduction: a comparative study. Statistics and Computing, 23, 777-791.

Cornillon, P.-A.; Hengartner, N. and Matzner-Lober, E. (2013) Recursive bias estimation for multivariate regression smoothers Recursive bias estimation for multivariate regression smoothers. ESAIM: Probability and Statistics, 18, 483-502.

Cornillon, P.-A.; Hengartner, N. and Matzner-Lober, E. (2017) Iterative Bias Reduction Multivariate Smoothing in R: The ibr Package. Journal of Statistical Software, 77, 1–26.

Wood, S.N. (2003) Thin plate regression splines. J. R. Statist. Soc. B, 65, 95-114.

Examples

f <- function(x, y) { .75*exp(-((9*x-2)^2 + (9*y-2)^2)/4) +
                      .75*exp(-((9*x+1)^2/49 + (9*y+1)^2/10)) +
                      .50*exp(-((9*x-7)^2 + (9*y-3)^2)/4) -
                      .20*exp(-((9*x-4)^2 + (9*y-7)^2)) }
# define a (fine) x-y grid and calculate the function values on the grid
ngrid <- 50; xf <- seq(0,1, length=ngrid+2)[-c(1,ngrid+2)]
yf <- xf ; zf <- outer(xf, yf, f)
grid <- cbind.data.frame(x=rep(xf, ngrid),y=rep(xf, rep(ngrid, ngrid)),z=as.vector(zf))
persp(xf, yf, zf, theta=130, phi=20, expand=0.45,main="True Function")
#generate a data set with function f and noise to signal ratio 5
noise <- .2 ; N <- 100 
xr <- seq(0.05,0.95,by=0.1) ; yr <- xr ; zr <- outer(xr,yr,f) ; set.seed(25)
std <- sqrt(noise*var(as.vector(zr))) ; noise <- rnorm(length(zr),0,std)
Z <- zr + matrix(noise,sqrt(N),sqrt(N))
# transpose the data to a column format 
xc <- rep(xr, sqrt(N)) ; yc <- rep(yr, rep(sqrt(N),sqrt(N)))
data <- cbind.data.frame(x=xc,y=yc,z=as.vector(Z))
# fit by thin plate splines (of order 2) ibr
res.ibr <- ibr(z~x+y,data=data,df=1.1,smoother="tps")
fit <- matrix(predict(res.ibr,grid),ngrid,ngrid)
persp(xf, yf, fit ,theta=130,phi=20,expand=0.45,main="Fit",zlab="fit")

## Not run: 
data(ozone, package = "ibr")
res.ibr <- ibr(Ozone~.,data=ozone,df=1.1)
summary(res.ibr)
predict(res.ibr)
## End(Not run)

[Package ibr version 2.0-4 Index]