PVS.fit {fsemipar}R Documentation

Impact point selection with PVS

Description

This function implements the Partitioning Variable Selection (PVS) algorithm. This algorithm is specifically designed for estimating multivarite linear models, where the scalar covariates are derived from the discretisation of a curve.

PVS is a two-stage procedure that selects the impact points of the discretised curve and estimates the model. The algorithm employs a penalised least-squares regularisation procedure. Additionally, it utilises an objective criterion (criterion) to determine the initial number of covariates in the reduced model (w.opt) of the first stage, and the penalisation parameter (lambda.opt).

Usage

PVS.fit(z, y, train.1 = NULL, train.2 = NULL, lambda.min = NULL, 
lambda.min.h = NULL, lambda.min.l = NULL, factor.pn = 1, nlambda = 100, 
vn = ncol(z), nfolds = 10, seed = 123, wn = c(10, 15, 20), range.grid = NULL, 
criterion = "GCV", penalty = "grSCAD", max.iter = 1000)

Arguments

z

Matrix containing the observations of the functional covariate collected by row (linear component).

y

Vector containing the scalar response.

train.1

Positions of the data that are used as the training sample in the 1st step. The default setting is train.1<-1:ceiling(n/2).

train.2

Positions of the data that are used as the training sample in the 2nd step. The default setting is train.2<-(ceiling(n/2)+1):n.

lambda.min

The smallest value for lambda (i. e., the lower endpoint of the sequence in which lambda.opt is selected), as fraction of lambda.max. The defaults is lambda.min.l if the sample size is larger than factor.pn times the number of linear covariates and lambda.min.h otherwise.

lambda.min.h

The lower endpoint of the sequence in which lambda.opt is selected if the sample size is smaller than factor.pn times the number of linear covariates. The default is 0.05.

lambda.min.l

The lower endpoint of the sequence in which lambda.opt is selected if the sample size is larger than factor.pn times the number of linear covariates. The default is 0.0001.

factor.pn

Positive integer used to set lambda.min. The default value is 1.

nlambda

Number of values in the sequence from which lambda.opt is selected. The default is 100.

vn

Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is vn=ncol(z), resulting in the individual penalization of each scalar covariate.

nfolds

Number of cross-validation folds (used when criterion="k-fold-CV"). Default is 10.

seed

You may set the seed for the random number generator to ensure reproducible results (applicable when criterion="k-fold-CV" is used). The default seed value is 123.

wn

A vector of positive integers indicating the eligible number of covariates in the reduced model. For more information, refer to the section Details. The default is c(10,15,20).

range.grid

Vector of length 2 containing the endpoints of the grid at which the observations of the functional covariate x are evaluated (i.e. the range of the discretisation). If range.grid=NULL, then range.grid=c(1,p) is considered, where p is the discretisation size of x (i.e. ncol(x)).

criterion

The criterion used to select the tuning and regularisation parameters: wn.opt and lambda.opt (also vn.opt if needed). Options include "GCV", "BIC", "AIC", or "k-fold-CV". The default setting is "GCV".

penalty

The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".

max.iter

Maximum number of iterations allowed across the entire path. The default value is 1000.

Details

The sparse linear model with covariates coming from the discretization of a curve is given by the expression

Yi=j=1pnβ0jζi(tj)+εi,   (i=1,,n)Y_i=\sum_{j=1}^{p_n}\beta_{0j}\zeta_i(t_j)+\varepsilon_i,\ \ \ (i=1,\dots,n)

where

In this model, it is assumed that only a few scalar variables from the set {ζ(t1),,ζ(tpn)}\{\zeta(t_1),\dots,\zeta(t_{p_n})\} are part of the model. Therefore, the relevant variables (the impact points of the curve ζ\zeta on the response) must be selected, and the model estimated.

In this function, this model is fitted using the PVS. The PVS is a two-steps procedure. So we divide the sample into two independent subsamples, each asymptotically half the size of the original sample (n1n2n/2n_1\sim n_2\sim n/2). One subsample is used in the first stage of the method, and the other in the second stage.The subsamples are defined as follows:

E1={(ζi,Xi,Yi),i=1,,n1}, \mathcal{E}^{\mathbf{1}}=\{(\zeta_i,\mathcal{X}_i,Y_i),\quad i=1,\dots,n_1\},

E2={(ζi,Xi,Yi),i=n1+1,,n1+n2=n}. \mathcal{E}^{\mathbf{2}}=\{(\zeta_i,\mathcal{X}_i,Y_i),\quad i=n_1+1,\dots,n_1+n_2=n\}.

Note that these two subsamples are specified to the program through the arguments train.1 and train.2. The superscript s\mathbf{s}, where s=1,2\mathbf{s}=\mathbf{1},\mathbf{2}, indicates the stage of the method in which the sample, function, variable, or parameter is involved.

To explain the algorithm, we assume that the number pnp_n of linear covariates can be expressed as follows: pn=qnwnp_n=q_nw_n, with qnq_n and wnw_n being integers.

  1. First step. A reduced model is considered, discarding many linear covariates. The penalised least-squares procedure is applied to the reduced model using only the subsample E1\mathcal{E}^{\mathbf{1}}. Specifically:

    • Consider a subset of the initial pnp_n linear covariates, containing only wnw_n equally spaced discretized observations of ζ\zeta covering the interval [a,b][a,b]. This subset is the following:

      Rn1={ζ(tk1),  k=1,,wn}, \mathcal{R}_n^{\mathbf{1}}=\left\{\zeta\left(t_k^{\mathbf{1}}\right),\ \ k=1,\dots,w_n\right\},

      where tk1=t[(2k1)qn/2]t_k^{\mathbf{1}}=t_{\left[(2k-1)q_n/2\right]} and [z]\left[z\right] denotes the smallest integer not less than the real number zz. The size (cardinality) of this subset is provided to the program in the argument wn (which contains a sequence of eligible sizes).

    • Consider the following reduced model involving only the wnw_n linear covariates from Rn1\mathcal{R}_n^{\mathbf{1}}: Rn1\mathcal{R}_n^{\mathbf{1}}:

      Yi=k=1wnβ0k1ζi(tk1)+εi1. Y_i=\sum_{k=1}^{w_n}\beta_{0k}^{\mathbf{1}}\zeta_i(t_k^{\mathbf{1}})+\varepsilon_i^{\mathbf{1}}.

      The penalised least-squares variable selection procedure is applied to the reduced model using the function lm.pels.fit, which requires the remaining arguments (for details, see the documentation of the function lm.pels.fit). The estimates obtained are the outputs of the first step of the algorithm.

  2. Second step. The variables selected in the first step, along with the variables in their neighborhood, are included. Then the penalised least-squares procedure is carried out again considering only the subsample E2\mathcal{E}^{\mathbf{2}}. Specifically:

    • Consider a new set of variables :

      Rn2={k,β^0k10}{ζ(t(k1)qn+1),,ζ(tkqn)}. \mathcal{R}_n^{\mathbf{2}}=\bigcup_{\left\{k,\widehat{\beta}_{0k}^{\mathbf{1}}\not=0\right\}}\left\{\zeta(t_{(k-1)q_n+1}),\dots,\zeta(t_{kq_n})\right\}.

      Denoting by rn=(Rn2)r_n=\sharp(\mathcal{R}_n^{\mathbf{2}}), we can rename the variables in Rn2\mathcal{R}_n^{\mathbf{2}} as follows:

      Rn2={ζ(t12),,ζ(trn2)}, \mathcal{R}_n^{\mathbf{2}}=\left\{\zeta(t_1^{\mathbf{2}}),\dots,\zeta(t_{r_n}^{\mathbf{2}})\right\},

    • Consider the following model, which involves only the linear covariates belonging to Rn2\mathcal{R}_n^{\mathbf{2}}

      Yi=k=1rnβ0k2ζi(tk2)+εi2. Y_i=\sum_{k=1}^{r_n}\beta_{0k}^{\mathbf{2}}\zeta_i(t_k^{\mathbf{2}})+\varepsilon_i^{\mathbf{2}}.

      The penalised least-squares variable selection procedure is applied to this model using lm.pels.fit.

The outputs of the second step are the estimates of the model. For further details on this algorithm, see Aneiros and Vieu (2014).

Remark: If the condition pn=wnqnp_n=w_n q_n is not met (then pn/wnp_n/w_n is not an integer), the function considers variable qn=qn,kq_n=q_{n,k} values k=1,,wnk=1,\dots,w_n. Specifically:

qn,k={[pn/wn]+1k{1,,pnwn[pn/wn]},[pn/wn]k{pnwn[pn/wn]+1,,wn}, q_{n,k}= \left\{\begin{array}{ll} [p_n/w_n]+1 & k\in\{1,\dots,p_n-w_n[p_n/w_n]\},\\ {[p_n/w_n]} & k\in\{p_n-w_n[p_n/w_n]+1,\dots,w_n\}, \end{array} \right.

where [z][z] denotes the integer part of the real number zz.

Value

call

The matched call.

fitted.values

Estimated scalar response.

residuals

Differences between y and the fitted.values.

beta.est

β^\hat{\mathbf{\beta}} (i. e. estimate of β0\mathbf{\beta}_0 when the optimal tuning parameters w.opt and lambda.opt are used).

indexes.beta.nonnull

Indexes of the non-zero βj^\hat{\beta_{j}}.

w.opt

Selected size for Rn1\mathcal{R}_n^{\mathbf{1}}.

lambda.opt

Selected value of the penalisation parameter λ\lambda (when w.opt is considered).

IC

Value of the criterion function considered to select w.opt and lambda.opt.

beta2

Estimate of β02\mathbf{\beta}_0^{\mathbf{2}} for each value of the sequence wn.

indexes.beta.nonnull2

Indexes of the non-zero linear coefficients after the step 2 of the method for each value of the sequence wn.

IC2

Optimal value of the criterion function in the second step for each value of the sequence wn.

lambda2

Selected value of penalisation parameter in the second step for each value of the sequence wn.

index02

Indexes of the covariates (in the entire set of pnp_n) used to build Rn2\mathcal{R}_n^{\mathbf{2}} for each value of the sequence wn.

beta1

Estimate of β01\mathbf{\beta}_0^{\mathbf{1}} for each value of the sequence wn.

IC1

Optimal value of the criterion function in the first step for each value of the sequence wn.

lambda1

Selected value of penalisation parameter in the first step for each value of the sequence wn.

index01

Indexes of the covariates (in the entire set of pnp_n) used to build Rn1\mathcal{R}_n^{\mathbf{1}} for each value of the sequence wn.

index1

Indexes of the non-zero linear coefficients after the step 1 of the method for each value of the sequence wn.

...

Author(s)

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

References

Aneiros, G. and Vieu, P. (2014) Variable selection in infinite-dimensional problems. Statistics & Probability Letters, 94, 12–20, doi:10.1016/j.spl.2014.06.025.

See Also

See also lm.pels.fit.

Examples

data(Sugar)

y<-Sugar$ash
z<-Sugar$wave.240

#Outliers
index.y.25 <- y > 25
index.atip <- index.y.25
(1:268)[index.atip]


#Dataset to model
z.sug<- z[!index.atip,]
y.sug <- y[!index.atip]

train<-1:216

ptm=proc.time()
fit<- PVS.fit(z=z.sug[train,], y=y.sug[train],train.1=1:108,train.2=109:216,
        lambda.min.h=0.2,criterion="BIC", max.iter=5000)
proc.time()-ptm

fit 
names(fit)

[Package fsemipar version 1.1.1 Index]