R: SFPLM regularised fit using kNN estimation

sfpl.kNN.fit {fsemipar}

R Documentation

SFPLM regularised fit using kNN estimation

Description

This function fits a sparse semi-functional partial linear model (SFPLM). It employs a penalised least-squares regularisation procedure, integrated with nonparametric kNN estimation using Nadaraya-Watson weights.

The procedure utilises an objective criterion (criterion) to select both the bandwidth (h.opt) and the regularisation parameter (lambda.opt).

Usage

sfpl.kNN.fit(x, z, y, semimetric = "deriv", q = NULL, knearest = NULL,
min.knn = 2, max.knn = NULL, step = NULL, range.grid = NULL, 
kind.of.kernel = "quad", nknot = NULL, lambda.min = NULL, lambda.min.h = NULL, 
lambda.min.l = NULL, factor.pn = 1, nlambda = 100, lambda.seq = NULL, 
vn = ncol(z), nfolds = 10, seed = 123, criterion = "GCV", penalty = "grSCAD", 
max.iter = 1000)

Arguments

`x`	Matrix containing the observations of the functional covariate (functional nonparametric component), collected by row.
`z`	Matrix containing the observations of the scalar covariates (linear component), collected by row.
`y`	Vector containing the scalar response.
`semimetric`	Semi-metric function. Only `"deriv"` and `"pca"` are implemented. By default `semimetric="deriv"`.
`q`	Order of the derivative (if `semimetric="deriv"`) or number of principal components (if `semimetric="pca"`). The default values are 0 and 2, respectively.
`knearest`	Vector of positive integers containing the sequence in which the number of nearest neighbours `k.opt` is selected. If `knearest=NULL`, then `knearest <- seq(from =min.knn, to = max.knn, by = step)`.
`min.knn`	A positive integer that represents the minimum value in the sequence for selecting the number of nearest neighbours `k.opt`. This value should be less than the sample size. The default is 2.
`max.knn`	A positive integer that represents the maximum value in the sequence for selecting number of nearest neighbours `k.opt`. This value should be less than the sample size. The default is `max.knn <- n%/%5`.
`step`	A positive integer used to construct the sequence of k-nearest neighbours as follows: `min.knn, min.knn + step, min.knn + 2step, min.knn + 3step,...`. The default value for `step` is `step<-ceiling(n/100)`.
`range.grid`	Vector of length 2 containing the endpoints of the grid at which the observations of the functional covariate `x` are evaluated (i.e. the range of the discretisation). If `range.grid=NULL`, then `range.grid=c(1,p)` is considered, where `p` is the discretisation size of `x` (i.e. `ncol(x))`.
`kind.of.kernel`	The type of kernel function used. Currently, only Epanechnikov kernel (`"quad"`) is available.
`nknot`	Positive integer indicating the number of interior knots for the B-spline expansion of the functional covariate. The default value is `(p - order.Bspline - 1)%/%2`.
`lambda.min`	The smallest value for lambda (i.e. the lower endpoint of the sequence in which `lambda.opt` is selected), as fraction of `lambda.max`. The defaults is `lambda.min.l` if the sample size is larger than `factor.pn` times the number of linear covariates and `lambda.min.h` otherwise.
`lambda.min.h`	The lower endpoint of the sequence in which `lambda.opt` is selected if the sample size is smaller than `factor.pn` times the number of linear covariates. The default is 0.05.
`lambda.min.l`	The lower endpoint of the sequence in which `lambda.opt` is selected if the sample size is larger than `factor.pn` times the number of linear covariates. The default is 0.0001.
`factor.pn`	Positive integer used to set `lambda.min`. The default value is 1.
`nlambda`	Positive integer indicating the number of values in the sequence from which `lambda.opt` is selected. The default is 100.
`lambda.seq`	Sequence of values in which `lambda.opt` is selected. If `lambda.seq=NULL`, then the programme builds the sequence automatically using `lambda.min` and `nlambda`.
`vn`	Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is `vn=ncol(z)`, resulting in the individual penalization of each scalar covariate.
`nfolds`	Number of cross-validation folds (used when `criterion="k-fold-CV"`). Default is 10.
`seed`	You may set the seed for the random number generator to ensure reproducible results (applicable when `criterion="k-fold-CV"` is used). The default seed value is 123.
`criterion`	The criterion used to select the tuning and regularisation parameter: `k.opt` and `lambda.opt` (also `vn.opt` if needed). Options include `"GCV"`, `"BIC"`, `"AIC"`, or `"k-fold-CV"`. The default setting is `"GCV"`.
`penalty`	The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".
`max.iter`	Maximum number of iterations allowed across the entire path. The default value is 1000.

Details

The sparse semi-functional partial linear model (SFPLM) is given by the expression:

Y_i = Z_{i1}\beta_{01} + \dots + Z_{ip_n}\beta_{0p_n} + m(X_i) + \varepsilon_i,\ \ \ i = 1, \dots, n,

where Y_i denotes a scalar response, Z_{i1}, \dots, Z_{ip_n} are real random covariates, and X_i is a functional random covariate valued in a semi-metric space \mathcal{H}. In this equation, \mathbf{\beta}_0 = (\beta_{01}, \dots, \beta_{0p_n})^{\top} and m(\cdot) represent a vector of unknown real parameters and an unknown smooth real-valued function, respectively. Additionally, \varepsilon_i is the random error.

In this function, the SFPLM is fitted using a penalised least-squares approach. The approach involves transforming the SFPLM into a linear model by extracting from Y_i and Z_{ij} (j = 1, \ldots, p_n) the effect of the functional covariate X_i using functional nonparametric regression (for details, see Ferraty and Vieu, 2006). This transformation is achieved using kNN estimation with Nadaraya-Watson weights.

An approximate linear model is then obtained:

\widetilde{\mathbf{Y}}\approx\widetilde{\mathbf{Z}}\mathbf{\beta}_0+\mathbf{\varepsilon},

and the penalised least-squares procedure is applied to this model by minimising

\mathcal{Q}\left(\mathbf{\beta}\right)=\frac{1}{2}\left(\widetilde{\mathbf{Y}}-\widetilde{\mathbf{Z}}\mathbf{\beta}\right)^{\top}\left(\widetilde{\mathbf{Y}}-\widetilde{\mathbf{Z}}\mathbf{\beta}\right)+n\sum_{j=1}^{p_n}\mathcal{P}_{\lambda_{j_n}}\left(|\beta_j|\right), \quad (1)

where \mathbf{\beta} = (\beta_1, \ldots, \beta_{p_n})^{\top}, \ \mathcal{P}_{\lambda_{j_n}}(\cdot) is a penalty function (specified in the argument penalty) and \lambda_{j_n} > 0 is a tuning parameter. To reduce the number of tuning parameters, \lambda_j, to be selected for each sample, we consider \lambda_j = \lambda \widehat{\sigma}_{\beta_{0,j,OLS}}, where \beta_{0,j,OLS} denotes the OLS estimate of \beta_{0,j} and \widehat{\sigma}_{\beta_{0,j,OLS}} is the estimated standard deviation. Both \lambda and k (in the kNN estimation) are selected using the objective criterion specified in the argument criterion.

Finally, after estimating \mathbf{\beta}_0 by minimising (1), we address the estimation of the nonlinear function m(\cdot). For this, we again employ the kNN procedure with Nadaraya-Watson weights to smooth the partial residuals Y_i - \mathbf{Z}_i^{\top}\widehat{\mathbf{\beta}}.

For further details on the estimation procedure of the sparse SFPLM, see Aneiros et al. (2015).

Remark: It should be noted that if we set lambda.seq to 0, we can obtain the non-penalised estimation of the model, i.e. the OLS estimation. Using lambda.seq with a value \not= 0 is advisable when suspecting the presence of irrelevant variables.

Value

`call`	The matched call.
`fitted.values`	Estimated scalar response.
`residuals`	Differences between `y` and the `fitted.values`
`beta.est`	Estimate of `\beta_0` when the optimal tuning parameters `lambda.opt`, `k.opt` and `vn.opt` are used.
`indexes.beta.nonnull`	Indexes of the non-zero `\hat{\beta_{j}}`.
`k.opt`	Selected number of nearest neighbours.
`lambda.opt`	Selected value of lambda.
`IC`	Value of the criterion function considered to select both `lambda.opt`, `h.opt` and `vn.opt`.
`vn.opt`	Selected value of `vn`.
`...`

Author(s)

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

References

Aneiros, G., Ferraty, F., Vieu, P. (2015) Variable selection in partial linear regression with functional covariate. Statistics, 49, 1322–1347, doi:10.1080/02331888.2014.998675.

Examples

data("Tecator")
y<-Tecator$fat
X<-Tecator$absor.spectra
z1<-Tecator$protein       
z2<-Tecator$moisture

#Quadratic, cubic and interaction effects of the scalar covariates.
z.com<-cbind(z1,z2,z1^2,z2^2,z1^3,z2^3,z1*z2)
train<-1:160

#SFPLM fit. 
ptm=proc.time()
fit<-sfpl.kNN.fit(y=y[train],x=X[train,], z=z.com[train,],q=2, max.knn=20,
  lambda.min.l=0.01, criterion="BIC",
  range.grid=c(850,1050), nknot=20, max.iter=5000)
proc.time()-ptm

#Results
fit
names(fit)

[Package fsemipar version 1.1.1 Index]