sfpl.kNN.fit {fsemipar}R Documentation

SFPLM regularised fit using kNN estimation

Description

This function fits a sparse semi-functional partial linear model (SFPLM). It employs a penalised least-squares regularisation procedure, integrated with nonparametric kNN estimation using Nadaraya-Watson weights.

The procedure utilises an objective criterion (criterion) to select both the bandwidth (h.opt) and the regularisation parameter (lambda.opt).

Usage

sfpl.kNN.fit(x, z, y, semimetric = "deriv", q = NULL, knearest = NULL,
min.knn = 2, max.knn = NULL, step = NULL, range.grid = NULL, 
kind.of.kernel = "quad", nknot = NULL, lambda.min = NULL, lambda.min.h = NULL, 
lambda.min.l = NULL, factor.pn = 1, nlambda = 100, lambda.seq = NULL, 
vn = ncol(z), nfolds = 10, seed = 123, criterion = "GCV", penalty = "grSCAD", 
max.iter = 1000)

Arguments

x

Matrix containing the observations of the functional covariate (functional nonparametric component), collected by row.

z

Matrix containing the observations of the scalar covariates (linear component), collected by row.

y

Vector containing the scalar response.

semimetric

Semi-metric function. Only "deriv" and "pca" are implemented. By default semimetric="deriv".

q

Order of the derivative (if semimetric="deriv") or number of principal components (if semimetric="pca"). The default values are 0 and 2, respectively.

knearest

Vector of positive integers containing the sequence in which the number of nearest neighbours k.opt is selected. If knearest=NULL, then knearest <- seq(from =min.knn, to = max.knn, by = step).

min.knn

A positive integer that represents the minimum value in the sequence for selecting the number of nearest neighbours k.opt. This value should be less than the sample size. The default is 2.

max.knn

A positive integer that represents the maximum value in the sequence for selecting number of nearest neighbours k.opt. This value should be less than the sample size. The default is max.knn <- n%/%5.

step

A positive integer used to construct the sequence of k-nearest neighbours as follows: min.knn, min.knn + step, min.knn + 2*step, min.knn + 3*step,.... The default value for step is step<-ceiling(n/100).

range.grid

Vector of length 2 containing the endpoints of the grid at which the observations of the functional covariate x are evaluated (i.e. the range of the discretisation). If range.grid=NULL, then range.grid=c(1,p) is considered, where p is the discretisation size of x (i.e. ncol(x)).

kind.of.kernel

The type of kernel function used. Currently, only Epanechnikov kernel ("quad") is available.

nknot

Positive integer indicating the number of interior knots for the B-spline expansion of the functional covariate. The default value is (p - order.Bspline - 1)%/%2.

lambda.min

The smallest value for lambda (i.e. the lower endpoint of the sequence in which lambda.opt is selected), as fraction of lambda.max. The defaults is lambda.min.l if the sample size is larger than factor.pn times the number of linear covariates and lambda.min.h otherwise.

lambda.min.h

The lower endpoint of the sequence in which lambda.opt is selected if the sample size is smaller than factor.pn times the number of linear covariates. The default is 0.05.

lambda.min.l

The lower endpoint of the sequence in which lambda.opt is selected if the sample size is larger than factor.pn times the number of linear covariates. The default is 0.0001.

factor.pn

Positive integer used to set lambda.min. The default value is 1.

nlambda

Positive integer indicating the number of values in the sequence from which lambda.opt is selected. The default is 100.

lambda.seq

Sequence of values in which lambda.opt is selected. If lambda.seq=NULL, then the programme builds the sequence automatically using lambda.min and nlambda.

vn

Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is vn=ncol(z), resulting in the individual penalization of each scalar covariate.

nfolds

Number of cross-validation folds (used when criterion="k-fold-CV"). Default is 10.

seed

You may set the seed for the random number generator to ensure reproducible results (applicable when criterion="k-fold-CV" is used). The default seed value is 123.

criterion

The criterion used to select the tuning and regularisation parameter: k.opt and lambda.opt (also vn.opt if needed). Options include "GCV", "BIC", "AIC", or "k-fold-CV". The default setting is "GCV".

penalty

The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".

max.iter

Maximum number of iterations allowed across the entire path. The default value is 1000.

Details

The sparse semi-functional partial linear model (SFPLM) is given by the expression:

Y_i = Z_{i1}\beta_{01} + \dots + Z_{ip_n}\beta_{0p_n} + m(X_i) + \varepsilon_i,\ \ \ i = 1, \dots, n,

where Y_i denotes a scalar response, Z_{i1}, \dots, Z_{ip_n} are real random covariates, and X_i is a functional random covariate valued in a semi-metric space \mathcal{H}. In this equation, \mathbf{\beta}_0 = (\beta_{01}, \dots, \beta_{0p_n})^{\top} and m(\cdot) represent a vector of unknown real parameters and an unknown smooth real-valued function, respectively. Additionally, \varepsilon_i is the random error.

In this function, the SFPLM is fitted using a penalised least-squares approach. The approach involves transforming the SFPLM into a linear model by extracting from Y_i and Z_{ij} (j = 1, \ldots, p_n) the effect of the functional covariate X_i using functional nonparametric regression (for details, see Ferraty and Vieu, 2006). This transformation is achieved using kNN estimation with Nadaraya-Watson weights.

An approximate linear model is then obtained:

\widetilde{\mathbf{Y}}\approx\widetilde{\mathbf{Z}}\mathbf{\beta}_0+\mathbf{\varepsilon},

and the penalised least-squares procedure is applied to this model by minimising

\mathcal{Q}\left(\mathbf{\beta}\right)=\frac{1}{2}\left(\widetilde{\mathbf{Y}}-\widetilde{\mathbf{Z}}\mathbf{\beta}\right)^{\top}\left(\widetilde{\mathbf{Y}}-\widetilde{\mathbf{Z}}\mathbf{\beta}\right)+n\sum_{j=1}^{p_n}\mathcal{P}_{\lambda_{j_n}}\left(|\beta_j|\right), \quad (1)

where \mathbf{\beta} = (\beta_1, \ldots, \beta_{p_n})^{\top}, \ \mathcal{P}_{\lambda_{j_n}}(\cdot) is a penalty function (specified in the argument penalty) and \lambda_{j_n} > 0 is a tuning parameter. To reduce the number of tuning parameters, \lambda_j, to be selected for each sample, we consider \lambda_j = \lambda \widehat{\sigma}_{\beta_{0,j,OLS}}, where \beta_{0,j,OLS} denotes the OLS estimate of \beta_{0,j} and \widehat{\sigma}_{\beta_{0,j,OLS}} is the estimated standard deviation. Both \lambda and k (in the kNN estimation) are selected using the objective criterion specified in the argument criterion.

Finally, after estimating \mathbf{\beta}_0 by minimising (1), we address the estimation of the nonlinear function m(\cdot). For this, we again employ the kNN procedure with Nadaraya-Watson weights to smooth the partial residuals Y_i - \mathbf{Z}_i^{\top}\widehat{\mathbf{\beta}}.

For further details on the estimation procedure of the sparse SFPLM, see Aneiros et al. (2015).

Remark: It should be noted that if we set lambda.seq to 0, we can obtain the non-penalised estimation of the model, i.e. the OLS estimation. Using lambda.seq with a value \not= 0 is advisable when suspecting the presence of irrelevant variables.

Value

call

The matched call.

fitted.values

Estimated scalar response.

residuals

Differences between y and the fitted.values

beta.est

Estimate of \beta_0 when the optimal tuning parameters lambda.opt, k.opt and vn.opt are used.

indexes.beta.nonnull

Indexes of the non-zero \hat{\beta_{j}}.

k.opt

Selected number of nearest neighbours.

lambda.opt

Selected value of lambda.

IC

Value of the criterion function considered to select both lambda.opt, h.opt and vn.opt.

vn.opt

Selected value of vn.

...

Author(s)

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

References

Aneiros, G., Ferraty, F., Vieu, P. (2015) Variable selection in partial linear regression with functional covariate. Statistics, 49, 1322–1347, doi:10.1080/02331888.2014.998675.

See Also

See also predict.sfpl.kNN and plot.sfpl.kNN.

Alternative method sfpl.kernel.fit.

Examples

data("Tecator")
y<-Tecator$fat
X<-Tecator$absor.spectra
z1<-Tecator$protein       
z2<-Tecator$moisture

#Quadratic, cubic and interaction effects of the scalar covariates.
z.com<-cbind(z1,z2,z1^2,z2^2,z1^3,z2^3,z1*z2)
train<-1:160

#SFPLM fit. 
ptm=proc.time()
fit<-sfpl.kNN.fit(y=y[train],x=X[train,], z=z.com[train,],q=2, max.knn=20,
  lambda.min.l=0.01, criterion="BIC",
  range.grid=c(850,1050), nknot=20, max.iter=5000)
proc.time()-ptm

#Results
fit
names(fit)

[Package fsemipar version 1.1.1 Index]