R: Impact point selection with FASSMR and kNN estimation

FASSMR.kNN.fit {fsemipar}

R Documentation

Impact point selection with FASSMR and kNN estimation

Description

This function implements the Fast Algorithm for Sparse Semiparametric Multi-functional Regression (FASSMR) with kNN estimation. This algorithm is specifically designed for estimating multi-functional partial linear single-index models, which incorporate multiple scalar variables and a functional covariate as predictors. These scalar variables are derived from the discretisation of a curve and have linear effect while the functional covariate exhibits a single-index effect.

FASSMR selects the impact points of the discretised curve and estimates the model. The algorithm employs a penalised least-squares regularisation procedure, integrated with kNN estimation using Nadaraya-Watson weights. It uses B-spline expansions to represent curves and eligible functional indexes. Additionally, it utilises an objective criterion (criterion) to determine the initial number of covariates in the reduced model (w.opt), the number of neighbours (k.opt), and the penalisation parameter (lambda.opt).

Usage

FASSMR.kNN.fit(x, z, y, seed.coeff = c(-1, 0, 1), order.Bspline = 3, 
nknot.theta = 3,  knearest = NULL, min.knn = 2, max.knn = NULL, step = NULL,  
kind.of.kernel = "quad",range.grid = NULL, nknot = NULL, lambda.min = NULL, 
lambda.min.h = NULL, lambda.min.l = NULL, factor.pn = 1, nlambda = 100, 
vn = ncol(z), nfolds = 10, seed = 123, wn = c(10, 15, 20), criterion = "GCV", 
penalty = "grSCAD", max.iter = 1000, n.core = NULL)

Arguments

`x`	Matrix containing the observations of the functional covariate collected by row (functional single-index component).
`z`	Matrix containing the observations of the functional covariate that is discretised collected by row (linear component).
`y`	Vector containing the scalar response.
`seed.coeff`	Vector of initial values used to build the set `\Theta_n` (see section `Details`). The coefficients for the B-spline representation of each eligible functional index `\theta \in \Theta_n` are obtained from `seed.coeff`. The default is `c(-1,0,1)`.
`order.Bspline`	Positive integer giving the order of the B-spline basis functions. This is the number of coefficients in each piecewise polynomial segment. The default is 3.
`nknot.theta`	Positive integer indicating the number of regularly spaced interior knots in the B-spline expansion of `\theta_0`. The default is 3.
`knearest`	Vector of positive integers containing the sequence in which the number of nearest neighbours `k.opt` is selected. If `knearest=NULL`, then `knearest <- seq(from =min.knn, to = max.knn, by = step)`.
`min.knn`	A positive integer that represents the minimum value in the sequence for selecting the number of nearest neighbours `k.opt`. This value should be less than the sample size. The default is 2.
`max.knn`	A positive integer that represents the maximum value in the sequence for selecting number of nearest neighbours `k.opt`. This value should be less than the sample size. The default is `max.knn <- n%/%5`.
`step`	A positive integer used to construct the sequence of k-nearest neighbours as follows: `min.knn, min.knn + step, min.knn + 2step, min.knn + 3step,...`. The default value for `step` is `step<-ceiling(n/100)`.
`kind.of.kernel`	The type of kernel function used. Currently, only Epanechnikov kernel (`"quad"`) is available.
`range.grid`	Vector of length 2 containing the endpoints of the grid at which the observations of the functional covariate `x` are evaluated (i.e. the range of the discretisation). If `range.grid=NULL`, then `range.grid=c(1,p)` is considered, where `p` is the discretisation size of `x` (i.e. `ncol(x))`.
`nknot`	Positive integer indicating the number of interior knots for the B-spline expansion of the functional covariate. The default value is `(p - order.Bspline - 1)%/%2`.
`lambda.min`	The smallest value for lambda (i. e., the lower endpoint of the sequence in which `lambda.opt` is selected), as fraction of `lambda.max`. The defaults is `lambda.min.l` if the sample size is larger than `factor.pn` times the number of linear covariates and `lambda.min.h` otherwise.
`lambda.min.h`	The lower endpoint of the sequence in which `lambda.opt` is selected if the sample size is smaller than `factor.pn` times the number of linear covariates. The default is 0.05.
`lambda.min.l`	The lower endpoint of the sequence in which `lambda.opt` is selected if the sample size is larger than `factor.pn` times the number of linear covariates. The default is 0.0001.
`factor.pn`	Positive integer used to set `lambda.min`. The default value is 1.
`nlambda`	Positive integer indicating the number of values in the sequence from which `lambda.opt` is selected. The default is 100.
`vn`	Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is `vn=ncol(z)`, resulting in the individual penalization of each scalar covariate.
`nfolds`	Positive integer indicating the number of cross-validation folds (used when `criterion="k-fold-CV"`). Default is 10.
`seed`	You may set the seed for the random number generator to ensure reproducible results (applicable when `criterion="k-fold-CV"` is used). The default seed value is 123.
`wn`	A vector of positive integers indicating the eligible number of covariates in the reduced model. For more information, refer to the section `Details`. The default is `c(10,15,20)`.
`criterion`	The criterion used to select the tuning and regularisation parameters: `wn.opt`, `k.opt` and `lambda.opt` (also `vn.opt` if needed). Options include `"GCV"`, `"BIC"`, `"AIC"`, or `"k-fold-CV"`. The default setting is `"GCV"`.
`penalty`	The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".
`max.iter`	Maximum number of iterations allowed across the entire path. The default value is 1000.
`n.core`	Number of CPU cores designated for parallel execution. The default is `n.core<-availableCores(omit=1)`.

Details

The multi-functional partial linear single-index model (MFPLSIM) is given by the expression

Y_i=\sum_{j=1}^{p_n}\beta_{0j}\zeta_i(t_j)+r\left(\left<\theta_0,X_i\right>\right)+\varepsilon_i,\ \ \ (i=1,\dots,n),

where:

Y_i is a real random response and X_i denotes a random element belonging to some separable Hilbert space \mathcal{H} with inner product denoted by \left\langle\cdot,\cdot\right\rangle. The second functional predictor \zeta_i is assumed to be a curve defined on some interval [a,b] which is observed at the points a\leq t_1<\dots<t_{p_n}\leq b.
\mathbf{\beta}_0=(\beta_{01},\dots,\beta_{0p_n})^{\top} is a vector of unknown real coefficients and r(\cdot) denotes a smooth unknown link function. In addition, \theta_0 is an unknown functional direction in \mathcal{H}.
\varepsilon_i denotes the random error.

In the MFPLSIM, we assume that only a few scalar variables from the set \{\zeta(t_1),\dots,\zeta(t_{p_n})\} form part of the model. Therefore, we must select the relevant variables in the linear component (the impact points of the curve \zeta on the response) and estimate the model.

In this function, the MFPLSIM is fitted using the FASSMR algorithm. The main idea of this algorithm is to consider a reduced model, with only some (very few) linear covariates (but covering the entire discretization interval of \zeta), and discarding directly the other linear covariates (since it is expected that they contain very similar information about the response).

To explain the algorithm, we assume, without loss of generality, that the number p_n of linear covariates can be expressed as follows: p_n=q_nw_n with q_n and w_n integers. This consideration allows us to build a subset of the initial p_n linear covariates, containging only w_n equally spaced discretised observations of \zeta covering the entire interval [a,b]. This subset is the following:

\mathcal{R}_n^{\mathbf{1}}=\left\{\zeta\left(t_k^{\mathbf{1}}\right),\ \ k=1,\dots,w_n\right\},

where t_k^{\mathbf{1}}=t_{\left[(2k-1)q_n/2\right]} and \left[z\right] denotes the smallest integer not less than the real number z.

We consider the following reduced model, which involves only the linear covariates belonging to \mathcal{R}_n^{\mathbf{1}}:

Y_i=\sum_{k=1}^{w_n}\beta_{0k}^{\mathbf{1}}\zeta_i(t_k^{\mathbf{1}})+r^{\mathbf{1}}\left(\left<\theta_0^{\mathbf{1}},\mathcal{X}_i\right>\right)+\varepsilon_i^{\mathbf{1}}.

The program receives the eligible numbers of linear covariates for building the reduced model through the argument wn. Then, the penalised least-squares variable selection procedure, with kNN estimation, is applied to the reduced model. This is done using the function sfplsim.kNN.fit, which requires the remaining arguments (for details, see the documentation of the function sfplsim.kNN.fit). The estimates obtained are the outputs of the FASSMR algorithm. For further details on this algorithm, see Novo et al. (2021).

Remark: If the condition p_n=w_n q_n is not met (then p_n/w_n is not an integer number), the function considers variable q_n=q_{n,k} values k=1,\dots,w_n. Specifically:

q_{n,k}= \left\{\begin{array}{ll} [p_n/w_n]+1 & k\in\{1,\dots,p_n-w_n[p_n/w_n]\},\\ {[p_n/w_n]} & k\in\{p_n-w_n[p_n/w_n]+1,\dots,w_n\}, \end{array} \right.

where [z] denotes the integer part of the real number z.

The function supports parallel computation. To avoid it, we can set n.core=1.

Value

`call`	The matched call.
`fitted.values`	Estimated scalar response.
`residuals`	Differences between `y` and the `fitted.values`.
`beta.est`	`\hat{\mathbf{\beta}}` (i.e. estimate of `\mathbf{\beta}_0` when the optimal tuning parameters `w.opt`, `lambda.opt`, `k.opt` and `vn.opt` are used).
`beta.red`	Estimate of `\beta_0^{\mathbf{1}}` in the reduced model when the optimal tuning parameters `w.opt`, `lambda.opt`, `k.opt` and `vn.opt` are used.
`theta.est`	Coefficients of `\hat{\theta}` in the B-spline basis (i.e. estimate of `\theta_0` when the optimal tuning parameters `w.opt`, `lambda.opt`, `k.opt` and `vn.opt` are used): a vector of `length(order.Bspline+nknot.theta)`.
`indexes.beta.nonnull`	Indexes of the non-zero `\hat{\beta_{j}}`.
`k.opt`	Selected number of nearest neighbours (when `w.opt` is considered).
`w.opt`	Selected size for `\mathcal{R}_n^{\mathbf{1}}`.
`lambda.opt`	Selected value for the penalisation parameter (when `w.opt` is considered).
`IC`	Value of the criterion function considered to select `w.opt`, `lambda.opt`, `k.opt` and `vn.opt`.
`vn.opt`	Selected value of `vn` (when `w.opt` is considered).
`beta.w`	Estimate of `\beta_0^{\mathbf{1}}` for each value of the sequence `wn` (i.e. for each number of covariates in the reduced model).
`theta.w`	Estimate of `\theta_0^{\mathbf{1}}` for each value of the sequence `wn` (i.e. its coefficients in the B-spline basis).
`IC.w`	Value of the criterion function for each value of the sequence `wn`.
`indexes.beta.nonnull.w`	Indexes of the non-zero linear coefficients for each value of the sequence `wn`.
`lambda.w`	Selected value of penalisation parameter for each value of the sequence `wn`.
`k.w`	Selected number of neighbours for each value of the sequence `wn`.
`index01`	Indexes of the covariates (in the entire set of `p_n`) used to build `\mathcal{R}_n^{\mathbf{1}}` for each value of the sequence `wn`.
`...`

Author(s)

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

References

Novo, S., Vieu, P., and Aneiros, G., (2021) Fast and efficient algorithms for sparse semiparametric bi-functional regression. Australian and New Zealand Journal of Statistics, 63, 606–638, doi:10.1111/anzs.12355.

Examples


data(Sugar)


y<-Sugar$ash
x<-Sugar$wave.290
z<-Sugar$wave.240

#Outliers
index.y.25 <- y > 25
index.atip <- index.y.25
(1:268)[index.atip]


#Dataset to model
x.sug <- x[!index.atip,]
z.sug<- z[!index.atip,]
y.sug <- y[!index.atip]

train<-1:216
ptm=proc.time()
fit<- FASSMR.kNN.fit(x=x.sug[train,],z=z.sug[train,], y=y.sug[train], 
        nknot.theta=2, lambda.min.l=0.03, max.knn=20,nknot=20,criterion="BIC",
        max.iter=5000)
proc.time()-ptm

fit
names(fit)

[Package fsemipar version 1.1.1 Index]