sfplsim.kNN.fit {fsemipar}R Documentation

SFPLSIM regularised fit using kNN estimation

Description

This function fits a sparse semi-functional partial linear single-index (SFPLSIM). It employs a penalised least-squares regularisation procedure, integrated with nonparametric kNN estimation using Nadaraya-Watson weights.

The function uses B-spline expansions to represent curves and eligible functional indexes. It also utilises an objective criterion (criterion) to select both the number of neighbours (k.opt) and the regularisation parameter (lambda.opt).

Usage

sfplsim.kNN.fit(x, z, y, seed.coeff = c(-1, 0, 1), order.Bspline = 3, 
nknot.theta = 3, knearest = NULL, min.knn = 2, max.knn = NULL, step = NULL,
range.grid = NULL, kind.of.kernel = "quad", nknot = NULL, lambda.min = NULL,
lambda.min.h = NULL, lambda.min.l = NULL, factor.pn = 1, nlambda = 100, 
lambda.seq = NULL, vn = ncol(z), nfolds = 10, seed = 123, criterion = "GCV",
penalty = "grSCAD", max.iter = 1000, n.core = NULL)

Arguments

x

Matrix containing the observations of the functional covariate (functional single-index component), collected by row.

z

Matrix containing the observations of the scalar covariates (linear component), collected by row.

y

Vector containing the scalar response.

seed.coeff

Vector of initial values used to build the set Θn\Theta_n (see section Details). The coefficients for the B-spline representation of each eligible functional index θΘn\theta \in \Theta_n are obtained from seed.coeff. The default is c(-1,0,1).

order.Bspline

Positive integer giving the order of the B-spline basis functions. This is the number of coefficients in each piecewise polynomial segment. The default is 3.

nknot.theta

Positive integer indicating the number of regularly spaced interior knots in the B-spline expansion of θ0\theta_0. The default is 3.

knearest

Vector of positive integers containing the sequence in which the number of nearest neighbours k.opt is selected. If knearest=NULL, then knearest <- seq(from =min.knn, to = max.knn, by = step).

min.knn

A positive integer that represents the minimum value in the sequence for selecting the number of nearest neighbours k.opt. This value should be less than the sample size. The default is 2.

max.knn

A positive integer that represents the maximum value in the sequence for selecting number of nearest neighbours k.opt. This value should be less than the sample size. The default is max.knn <- n%/%5.

step

A positive integer used to construct the sequence of k-nearest neighbours as follows: min.knn, min.knn + step, min.knn + 2*step, min.knn + 3*step,.... The default value for step is step<-ceiling(n/100).

range.grid

Vector of length 2 containing the endpoints of the grid at which the observations of the functional covariate x are evaluated (i.e. the range of the discretisation). If range.grid=NULL, then range.grid=c(1,p) is considered, where p is the discretisation size of x (i.e. ncol(x)).

kind.of.kernel

The type of kernel function used. Currently, only Epanechnikov kernel ("quad") is available.

nknot

Positive integer indicating the number of interior knots for the B-spline expansion of the functional covariate. The default value is (p - order.Bspline - 1)%/%2.

lambda.min

The smallest value for lambda (i. e., the lower endpoint of the sequence in which lambda.opt is selected), as fraction of lambda.max. The defaults is lambda.min.l if the sample size is larger than factor.pn times the number of linear covariates and lambda.min.h otherwise.

lambda.min.h

The lower endpoint of the sequence in which lambda.opt is selected if the sample size is smaller than factor.pn times the number of linear covariates. The default is 0.05.

lambda.min.l

The lower endpoint of the sequence in which lambda.opt is selected if the sample size is larger than factor.pn times the number of linear covariates. The default is 0.0001.

factor.pn

Positive integer used to set lambda.min. The default value is 1.

nlambda

Positive integer indicating the number of values in the sequence from which lambda.opt is selected. The default is 100.

lambda.seq

Sequence of values in which lambda.opt is selected. If lambda.seq=NULL, then the programme builds the sequence automatically using lambda.min and nlambda.

vn

Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is vn=ncol(z), resulting in the individual penalization of each scalar covariate.

nfolds

Number of cross-validation folds (used when criterion="k-fold-CV"). Default is 10.

seed

You may set the seed for the random number generator to ensure reproducible results (applicable when criterion="k-fold-CV" is used). The default seed value is 123.

criterion

The criterion used to select the tuning and regularisation parameter: h.opt and lambda.opt (also vn.opt if needed). Options include "GCV", "BIC", "AIC", or "k-fold-CV". The default setting is "GCV".

penalty

The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".

max.iter

Maximum number of iterations allowed across the entire path. The default value is 1000.

n.core

Number of CPU cores designated for parallel execution. The default is n.core<-availableCores(omit=1).

Details

The sparse semi-functional partial linear single-index model (SFPLSIM) is given by the expression:

Yi=Zi1β01++Zipnβ0pn+r(<θ0,Xi>)+εi   i=1,,n, Y_i=Z_{i1}\beta_{01}+\dots+Z_{ip_n}\beta_{0p_n}+r(\left<\theta_0,X_i\right>)+\varepsilon_i\ \ \ i=1,\dots,n,

where YiY_i denotes a scalar response, Zi1,,ZipnZ_{i1},\dots,Z_{ip_n} are real random covariates and XiX_i is a functional random covariate valued in a separable Hilbert space H\mathcal{H} with inner product ,\left\langle \cdot, \cdot \right\rangle. In this equation, β0=(β01,,β0pn)\mathbf{\beta}_0=(\beta_{01},\dots,\beta_{0p_n})^{\top}, θ0H\theta_0\in\mathcal{H} and r()r(\cdot) are a vector of unknown real parameters, an unknown functional direction and an unknown smooth real-valued function, respectively. In addition, εi\varepsilon_i is the random error.

The sparse SFPLSIM is fitted using the penalised least-squares approach. The first step is to transform the SSFPLSIM into a linear model by extracting from YiY_i and ZijZ_{ij} (j=1,,pnj=1,\ldots,p_n) the effect of the functional covariate XiX_i using functional single-index regression. This transformation is achieved using nonparametric kNN estimation (see, for details, the documentation of the function fsim.kNN.fit).

An approximate linear model is then obtained:

Y~θ0Z~θ0β0+ε,\widetilde{\mathbf{Y}}_{\theta_0}\approx\widetilde{\mathbf{Z}}_{\theta_0}\mathbf{\beta}_0+\mathbf{\varepsilon},

and the penalised least-squares procedure is applied to this model by minimising over the pair (β,θ)(\mathbf{\beta},\theta)

Q(β,θ)=12(Y~θZ~θβ)(Y~θZ~θβ)+nj=1pnPλjn(βj),(1) \mathcal{Q}\left(\mathbf{\beta},\theta\right)=\frac{1}{2}\left(\widetilde{\mathbf{Y}}_{\theta}-\widetilde{\mathbf{Z}}_{\theta}\mathbf{\beta}\right)^{\top}\left(\widetilde{\mathbf{Y}}_{\theta}-\widetilde{\mathbf{Z}}_{\theta}\mathbf{\beta}\right)+n\sum_{j=1}^{p_n}\mathcal{P}_{\lambda_{j_n}}\left(|\beta_j|\right), \quad (1)

where β=(β1,,βpn), Pλjn()\mathbf{\beta}=(\beta_1,\ldots,\beta_{p_n})^{\top}, \ \mathcal{P}_{\lambda_{j_n}}\left(\cdot\right) is a penalty function (specified in the argument penalty) and λjn>0\lambda_{j_n} > 0 is a tuning parameter. To reduce the quantity of tuning parameters, λj\lambda_j, to be selected for each sample, we consider λj=λσ^β0,j,OLS\lambda_j = \lambda \widehat{\sigma}_{\beta_{0,j,OLS}}, where β0,j,OLS\beta_{0,j,OLS} denotes the OLS estimate of β0,j\beta_{0,j} and σ^β0,j,OLS\widehat{\sigma}_{\beta_{0,j,OLS}} is the estimated standard deviation. Both λ\lambda and kk (in the kNN estimation) are selected using the objetive criterion specified in the argument criterion.

In addition, the function uses a B-spline representation to construct a set Θn\Theta_n of eligible functional indexes θ\theta. The dimension of the B-spline basis is order.Bspline+nknot.theta and the set of eligible coefficients is obtained by calibrating (to ensure the identifiability of the model) the set of initial coefficients given in seed.coeff. The larger this set, the greater the size of Θn\Theta_n. ue to the intensive computation required by our approach, a balance between the size of Θn\Theta_n and the performance of the estimator is necessary. For that, Ait-Saidi et al. (2008) suggested considering order.Bspline=3 and seed.coeff=c(-1,0,1). For details on the construction of Θn\Theta_n see Novo et al. (2019).

Finally, after estimating β0\mathbf{\beta}_0 and θ0\theta_0 by minimising (1), we proceed to estimate the nonlinear function rθ0()r(<θ0,>)r_{\theta_0}(\cdot)\equiv r\left(\left<\theta_0,\cdot\right>\right). For this purporse, we again apply the kNN procedure with Nadaraya-Watson weights to smooth the partial residuals YiZiβ^Y_i-\mathbf{Z}_i^{\top}\widehat{\mathbf{\beta}}.

For further details on the estimation procedure of the sparse SFPLSIM, see Novo et al. (2021).

Remark: It should be noted that if we set lambda.seq to 00, we can obtain the non-penalised estimation of the model, i.e. the OLS estimation. Using lambda.seq with a value 0\not= 0 is advisable when suspecting the presence of irrelevant variables.

Value

call

The matched call.

fitted.values

Estimated scalar response.

residuals

Differences between y and the fitted.values.

beta.est

β^\hat{\mathbf{\beta}} (i.e. the estimate of β0\mathbf{\beta}_0 when the optimal tuning parameters lambda.opt, k.opt and vn.opt are used).

theta.est

Coefficients of θ^\hat{\theta} in the B-spline basis (when the optimal tuning parameters lambda.opt, k.opt and vn.opt) are used): a vector of length(order.Bspline+nknot.theta).

indexes.beta.nonnull

Indexes of the non-zero βj^\hat{\beta_{j}}.

k.opt

Selected number of nearest neighbours.

lambda.opt

Selected value of the penalisation parameter λ\lambda.

IC

Value of the criterion function considered to select lambda.opt, k.opt and vn.opt.

Q.opt

Minimum value of the penalized criterion used to estimate β0\mathbf{\beta}_0 and θ0\theta_0. That is, the value obtained using theta.est and beta.est.

Q

Vector of dimension equal to the cardinal of Θn\Theta_n, containing the values of the penalized criterion for each functional index in Θn\Theta_n.

m.opt

Index of θ^\hat{\theta} in the set Θn\Theta_n.

lambda.min.opt.max.mopt

A grid of values in [lambda.min.opt.max.mopt[1], lambda.min.opt.max.mopt[3]] is considered to seek for the lambda.opt (lambda.opt=lambda.min.opt.max.mopt[2]).

lambda.min.opt.max.m

A grid of values in [lambda.min.opt.max.m[m,1], lambda.min.opt.max.m[m,3]] is considered to seek for the optimal λ\lambda (lambda.min.opt.max.m[m,2]) used by the optimal β\mathbf{\beta} for each θ\theta in Θn\Theta_n.

knn.min.opt.max.mopt

k.opt=knn.min.opt.max.mopt[2] (used by theta.est and beta.est) was seeked between knn.min.opt.max.mopt[1] and knn.min.opt.max.mopt[3] (no necessarly the step was 1).

knn.min.opt.max.m

For each θ\theta in Θn\Theta_n, the optimal kk (knn.min.opt.max.m[m,2]) used by the optimal β\beta for this θ\theta was seeked between knn.min.opt.max.m[m,1] and knn.min.opt.max.m[m,3] (no necessarly the step was 1).

knearest

Sequence of eligible values for kk considered to seek for k.opt.

theta.seq.norm

The vector theta.seq.norm[j,] contains the coefficientes in the B-spline basis of the jth functional index in Θn\Theta_n.

vn.opt

Selected value of vn.

...

Author(s)

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

References

Ait-Saidi, A., Ferraty, F., Kassa, R., and Vieu, P., (2008) Cross-validated estimations in the single-functional index model. Statistics, 42(6), 475–494, doi:10.1080/02331880801980377.

Novo S., Aneiros, G., and Vieu, P., (2019) Automatic and location-adaptive estimation in functional single-index regression. Journal of Nonparametric Statistics, 31(2), 364–392, doi:10.1080/10485252.2019.1567726.

Novo, S., Aneiros, G., and Vieu, P., (2021) Sparse semiparametric regression when predictors are mixture of functional and high-dimensional variables. TEST, 30, 481–504, doi:10.1007/s11749-020-00728-w.

Novo, S., Aneiros, G., and Vieu, P., (2021) A kNN procedure in semiparametric functional data analysis. Statistics and Probability Letters, 171, 109028, doi:10.1016/j.spl.2020.109028

See Also

See also fsim.kNN.fit, predict.sfplsim.kNN and plot.sfplsim.kNN

Alternative procedure sfplsim.kernel.fit.

Examples


data("Tecator")
y<-Tecator$fat
X<-Tecator$absor.spectra2
z1<-Tecator$protein       
z2<-Tecator$moisture

#Quadratic, cubic and interaction effects of the scalar covariates.
z.com<-cbind(z1,z2,z1^2,z2^2,z1^3,z2^3,z1*z2)
train<-1:160

#SSFPLSIM fit. Convergence errors for some theta are obtained.
ptm=proc.time()
fit<-sfplsim.kNN.fit(y=y[train],x=X[train,], z=z.com[train,], max.knn=20,
    lambda.min.l=0.01, factor.pn=2,  nknot.theta=4,
    criterion="BIC",range.grid=c(850,1050), 
    nknot=20, max.iter=5000)
proc.time()-ptm

#Results
fit
names(fit)


[Package fsemipar version 1.1.1 Index]