PVS.kernel.fit {fsemipar} | R Documentation |
Impact point selection with PVS and kernel estimation
Description
This function computes the partitioning variable selection (PVS) algorithm for multi-functional partial linear models (MFPLM).
PVS is a two-stage procedure that selects the impact points of the discretised curve and estimates the model. The algorithm employs a penalised least-squares regularisation procedure, integrated with kernel estimation with Nadaraya-Watson weights.
Additionally, it utilises an objective criterion (criterion
) to select the number of covariates in the reduced model (w.opt
), the bandwidth (h.opt
) and the penalisation parameter (lambda.opt
).
Usage
PVS.kernel.fit(x, z, y, train.1 = NULL, train.2 = NULL, semimetric = "deriv",
q = NULL, min.q.h = 0.05, max.q.h = 0.5, h.seq = NULL, num.h = 10,
range.grid = NULL, kind.of.kernel = "quad", nknot = NULL, lambda.min = NULL,
lambda.min.h = NULL, lambda.min.l = NULL, factor.pn = 1, nlambda = 100,
vn = ncol(z), nfolds = 10, seed = 123, wn = c(10, 15, 20), criterion = "GCV",
penalty = "grSCAD", max.iter = 1000)
Arguments
x |
Matrix containing the observations of the functional covariate (functional nonparametric component), collected by row. |
z |
Matrix containing the observations of the functional covariate that is discretised (linear component), collected by row. |
y |
Vector containing the scalar response. |
train.1 |
Positions of the data that are used as the training sample in the 1st step. The default setting is |
train.2 |
Positions of the data that are used as the training sample in the 2nd step. The default setting is |
semimetric |
Semi-metric function. Only |
q |
Order of the derivative (if |
min.q.h |
Minimum quantile order of the distances between curves, which are computed using the projection semi-metric. This value determines the lower endpoint of the range from which the bandwidth is selected. The default is 0.05. |
max.q.h |
Maximum quantile order of the distances between curves, which are computed using the projection semi-metric. This value determines the upper endpoint of the range from which the bandwidth is selected. The default is 0.5. |
h.seq |
Vector containing the sequence of bandwidths. The default is a sequence of |
num.h |
Positive integer indicating the number of bandwidths in the grid. The default is 10. |
range.grid |
Vector of length 2 containing the endpoints of the grid at which the observations of the functional covariate |
kind.of.kernel |
The type of kernel function used. Currently, only Epanechnikov kernel ( |
nknot |
Positive integer indicating the number of interior knots for the B-spline expansion of the functional covariate. The default value is |
lambda.min |
The smallest value for lambda (i.e. the lower endpoint of the sequence in which |
lambda.min.h |
The lower endpoint of the sequence in which |
lambda.min.l |
The lower endpoint of the sequence in which |
factor.pn |
Positive integer used to set |
nlambda |
Positive integer indicating the number of values in the sequence from which |
vn |
Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is |
nfolds |
Number of cross-validation folds (used when |
seed |
You may set the seed for the random number generator to ensure reproducible results (applicable when |
wn |
A vector of positive integers indicating the eligible number of covariates in the reduced model. For more information, refer to the section |
criterion |
The criterion used to select the tuning and regularisation parameters: |
penalty |
The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD". |
max.iter |
Maximum number of iterations allowed across the entire path. The default value is 1000. |
Details
The multi-functional partial linear model (MFPLM) is given by the expression
Y_i=\sum_{j=1}^{p_n}\beta_{0j}\zeta_i(t_j)+m\left(X_i\right)+\varepsilon_i,\ \ \ (i=1,\dots,n),
where:
-
Y_i
is a real random response andX_i
denotes a random element belonging to some semi-metric space\mathcal{H}
. The second functional predictor\zeta_i
is assumed to be a curve defined on some interval[a,b]
, observed at the pointsa\leq t_1<\dots<t_{p_n}\leq b
. -
\mathbf{\beta}_0=(\beta_{01},\dots,\beta_{0p_n})^{\top}
is a vector of unknown real coefficients andm(\cdot)
represents a smooth unknown real-valued link function. -
\varepsilon_i
denotes the random error.
In the MFPLM, it is assumed that only a few scalar variables from the set \{\zeta(t_1),\dots,\zeta(t_{p_n})\}
are part of the model. Therefore, the relevant variables in the linear component (the impact points of the curve \zeta
on the response) must be selected, and the model estimated.
In this function, the MFPLM is fitted using the PVS procedure, a two-step algorithm. For this, we divide the sample into two two independent subsamples (asymptotically of the same size n_1\sim n_2\sim n/2
). One subsample is used in the first stage of the method, and the other in the second stage.The subsamples are defined as follows:
\mathcal{E}^{\mathbf{1}}=\{(\zeta_i,\mathcal{X}_i,Y_i),\quad i=1,\dots,n_1\},
\mathcal{E}^{\mathbf{2}}=\{(\zeta_i,\mathcal{X}_i,Y_i),\quad i=n_1+1,\dots,n_1+n_2=n\}.
Note that these two subsamples are specified to the program through the arguments train.1
and train.2
.
The superscript \mathbf{s}
, where \mathbf{s}=\mathbf{1},\mathbf{2}
, indicates the stage of the method in which the sample, function, variable, or parameter is involved.
To explain the algorithm, let's assume that the number p_n
of linear covariates can be expressed as follows: p_n=q_nw_n
with q_n
and w_n
being integers.
-
First step. A reduced model is considered, discarding many linear covariates. The penalised least-squares procedure is applied to the reduced model using only the subsample
\mathcal{E}^{\mathbf{1}}
. Specifically:Consider a subset of the initial
p_n
linear covariates containing onlyw_n
equally spaced discretised observations of\zeta
covering the interval[a,b]
. This subset is the following:\mathcal{R}_n^{\mathbf{1}}=\left\{\zeta\left(t_k^{\mathbf{1}}\right),\ \ k=1,\dots,w_n\right\},
where
t_k^{\mathbf{1}}=t_{\left[(2k-1)q_n/2\right]}
and\left[z\right]
denotes the smallest integer not less than the real numberz
. The size (cardinality) of this subset is provided to the program through the argumentwn
, which contains the sequence of eligible sizes.Consider the following reduced model involving only the
w_n
linear covariates from\mathcal{R}_n^{\mathbf{1}}
:Y_i=\sum_{k=1}^{w_n}\beta_{0k}^{\mathbf{1}}\zeta_i(t_k^{\mathbf{1}})+m^{\mathbf{1}}\left(X_i\right)+\varepsilon_i^{\mathbf{1}}.
The penalised least-squares variable selection procedure, with kernel estimation, is applied to the reduced model using the function
sfpl.kernel.fit
, which requires the remaining arguments (for details, see the documentation of the functionsfpl.kernel.fit
). The estimates obtained after that are the outputs of the first step of the algorithm.
-
Second step. The variables selected in the first step, along with those in their neighborhood, are included. Then the penalised least-squares procedure, combined with kernel estimation, is carried out again, considering only the subsample
\mathcal{E}^{\mathbf{2}}
. Specifically:Consider a new set of variables:
\mathcal{R}_n^{\mathbf{2}}=\bigcup_{\left\{k,\widehat{\beta}_{0k}^{\mathbf{1}}\not=0\right\}}\left\{\zeta(t_{(k-1)q_n+1}),\dots,\zeta(t_{kq_n})\right\}.
Denoting by
r_n=\sharp(\mathcal{R}_n^{\mathbf{2}})
, we can rename the variables in\mathcal{R}_n^{\mathbf{2}}
as follows:\mathcal{R}_n^{\mathbf{2}}=\left\{\zeta(t_1^{\mathbf{2}}),\dots,\zeta(t_{r_n}^{\mathbf{2}})\right\},
Consider the following model, which involves only the linear covariates belonging to
\mathcal{R}_n^{\mathbf{2}}
Y_i=\sum_{k=1}^{r_n}\beta_{0k}^{\mathbf{2}}\zeta_i(t_k^{\mathbf{2}})+m^{\mathbf{2}}\left(X_i\right)+\varepsilon_i^{\mathbf{2}}.
The penalised least-squares variable selection procedure, with kernel estimation, is applied to this model using
sfpl.kernel.fit
.
The outputs of the second step are the estimates of the MFPLM. For further details on this algorithm, see Aneiros and Vieu (2015).
Remark: If the condition p_n=w_n q_n
is not met (then p_n/w_n
is not an integer), the function considers variable q_n=q_{n,k}
values k=1,\dots,w_n
. Specifically:
q_{n,k}= \left\{\begin{array}{ll}
[p_n/w_n]+1 & k\in\{1,\dots,p_n-w_n[p_n/w_n]\},\\
{[p_n/w_n]} & k\in\{p_n-w_n[p_n/w_n]+1,\dots,w_n\},
\end{array}
\right.
where [z]
denotes the integer part of the real number z
.
Value
call |
The matched call. |
fitted.values |
Estimated scalar response. |
residuals |
Differences between |
beta.est |
|
indexes.beta.nonnull |
Indexes of the non-zero |
h.opt |
Selected bandwidth (when |
w.opt |
Selected size for |
lambda.opt |
Selected value of the penalisation parameter |
IC |
Value of the criterion function considered to select |
vn.opt |
Selected value of |
beta2 |
Estimate of |
indexes.beta.nonnull2 |
Indexes of the non-zero linear coefficients after the step 2 of the method for each value of the sequence |
h2 |
Selected bandwidth in the second step of the algorithm for each value of the sequence |
IC2 |
Optimal value of the criterion function in the second step for each value of the sequence |
lambda2 |
Selected value of penalisation parameter in the second step for each value of the sequence |
index02 |
Indexes of the covariates (in the entire set of |
beta1 |
Estimate of |
h1 |
Selected bandwidth in the first step of the algorithm for each value of the sequence |
IC1 |
Optimal value of the criterion function in the first step for each value of the sequence |
lambda1 |
Selected value of penalisation parameter in the first step for each value of the sequence |
index01 |
Indexes of the covariates (in the entire set of |
index1 |
Indexes of the non-zero linear coefficients after the step 1 of the method for each value of the sequence |
... |
Author(s)
German Aneiros Perez german.aneiros@udc.es
Silvia Novo Diaz snovo@est-econ.uc3m.es
References
Aneiros, G., and Vieu, P. (2015) Partial linear modelling with multi-functional covariates. Computational Statistics, 30, 647–671, doi:10.1007/s00180-015-0568-8.
See Also
See also sfpl.kernel.fit, predict.PVS.kernel
and plot.PVS.kernel
.
Alternative method PVS.kNN.fit
.
Examples
data(Sugar)
y<-Sugar$ash
x<-Sugar$wave.290
z<-Sugar$wave.240
#Outliers
index.y.25 <- y > 25
index.atip <- index.y.25
(1:268)[index.atip]
#Dataset to model
x.sug <- x[!index.atip,]
z.sug<- z[!index.atip,]
y.sug <- y[!index.atip]
train<-1:216
ptm=proc.time()
fit<- PVS.kernel.fit(x=x.sug[train,],z=z.sug[train,], y=y.sug[train],
train.1=1:108,train.2=109:216,lambda.min.h=0.03,
lambda.min.l=0.03, max.q.h=0.35, nknot=20,
criterion="BIC", max.iter=5000)
proc.time()-ptm
fit
names(fit)