R: Nonparametric density estimation with differential...

DE.FEM {fdaPDE}

R Documentation

Nonparametric density estimation with differential regularization

Description

This function implements a nonparametric density estimation method with differential regularization (given by the square root of the L2 norm of the laplacian of the density function), when points are located over a planar mesh. The computation relies only on the C++ implementation of the algorithm.

Usage

DE.FEM(data, FEMbasis, lambda, scaling=NULL, fvec=NULL, heatStep=0.1, heatIter=500,
       stepProposals=NULL,tol1=1e-4, tol2=0, print=FALSE, nfolds=NULL,
       nsimulations=500, step_method="Fixed_Step", direction_method="BFGS",
       preprocess_method="NoCrossValidation", search = "tree", inference = FALSE)

Arguments

`data`	A matrix of dimensions #observations-by-ndim. Data are locations: each row corresponds to one point, the first column corresponds to the `x`-coordinates, the second column corresponds to the `y`-coordinates and, if ndim=3, the third column corresponds to the `z`-coordinates.
`FEMbasis`	A `FEMbasis` object describing the Finite Element basis, as created by `create.FEM.basis`.
`lambda`	A scalar or vector of smoothing parameters. If it is a vector, the optimal smoothing parameter is chosen with a `k`-fold cross-validation procedure based on the L2 norm.
`scaling`	A positive factor needed to scale the smoothing parameter in the construction of confidence intervals. If the scaling is not specified, it is automatically set as the square root of the number of observations.
`fvec`	A vector of length #`nodes` of the mesh. It corresponds to the node values of the initial density function. If this is `NULL` the initial density is estimated thanks to a discretized heat diffusion process that starts from the empirical density of the data. Default is `NULL`. N.B. This vector cannot be the constant vector of zeros since the algorithm works with the log(f).
`heatStep`	A real specifying the time step for the discretized heat diffusion process. Default is `0.1`.
`heatIter`	An integer specifying the number of iterations to perform the discretized heat diffusion process. Default is `500`.
`stepProposals`	A scalar or a vector containing the step parameters useful for the descent algorithm. If there is a vector of parameters, the biggest one such that the functional decreases at each iteration is chosen. If it is `NULL` the following vector `c(0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 1e-7, 1e-8, 1e-9)` is proposed. Default is `NULL`. N.B. If the program does not receive a right parameter, it aborts the R session. Try a smaller parameter.
`tol1`	A scalar specifying the tolerance to use for the termination criterion based on the percentage difference between two consecutive iterations of the minimization algorithm of the loss function, the log-likelihood and the penalization. Default is 1e-5.
`tol2`	A scalar specifying the tolerance to use for the termination criterion based on the norm of the gradient of the functional to be minimized (the true minimum is such that this norm is zero). The default does not use this criterion. Default is 0.
`print`	A boolean that is `TRUE` if the user wants the value of the functional, of the loglikelihood and of the penalization term printed on console at each iteration of the descent algorithm. Default is `FALSE`. N.B. We suggest to let it `FALSE` if `preprocess_method` is 'RightCV' or 'SimplifiedCV'.
`nfolds`	An integer specifying the number of folds used in cross validation technique to find the best `lambda` parameter. If there is only one `lambda` it can be `NULL`. Default is `NULL`.
`nsimulations`	An integer specifying the number of iterations used in the optimization algorithms. Default value is 500.
`step_method`	String. This parameter specifies which step method use in the descent algorithm. If it is `Fixed_Step`, the step is constant during all the algorithm and it is chosen according to `stepProposals`; if it is `Backtracking_Method`, the step is computed at each iteration according to the backtracking method; finally if it is `Wolfe_Method`, the step is computed at each iteration according to the Wolfe method. Default is `Fixed_Step`.
`direction_method`	String. This parameter specifies which descent direction use in the descent algorithm. If it is `Gradient`, the direction is the one given by the gradient descent method (the opposite to the gradient of the functional); if instead it is `BFGS` the direction is the one given by the BFGS method (Broyden Fletcher Goldfarb and Shanno, a Quasi-Newton method). Default is `BFGS`. Other possible choices: Conjugate Gradient direction with Fletcher-Reeves formula (`ConjugateGradientFR`), Conjugate Gradient direction with Polak-Ribiére-Polyak formula (`ConjugateGradientPRP`), Conjugate Gradient direction with Hestenes-Stiefel formula (`ConjugateGradientHS`), Conjugate Gradient direction with Dai-Yuan formula (`ConjugateGradientDY`), Conjugate Gradient direction with Conjugate-Descent formula (`ConjugateGradientCD`), Conjugate Gradient direction with Liu-Storey formula (`ConjugateGradientLS`), L-BFGS direction with 5 correction vectors (`L-BFGS5`), L-BFGS direction with 10 correction vectors (`L-BFGS10`).
`preprocess_method`	String. This parameter specifies the k fold cross validation technique to use, if there is more than one smoothing parameter `lambda` (otherwise it should be `NULL`). If it is `RightCV` the usual k fold cross validation method is performed. If it is `SimplifiedCV` a simplified version is performed. In the latter case the number of smoothing parameters `lambda` must be equal to the number of folds `nfolds`. Default is `NULL`.
`search`	a flag to decide the search algorithm type (tree or naive or walking search algorithm). Default is `tree`.
`inference`	A boolean that is `TRUE` if the user wants to estimate confidence intervals. Default is `FALSE`.

Value

A list with the following variables:

`FEMbasis`	Given FEMbasis with tree information.
`g`	A vector of length #`nodes` that represents the value of the g-function estimated for each `node` of the mesh. The density is the exponential of this function.
`f_init`	A #`nodes`-by-#`lambda` parameters matrix. Each column contains the node values of the initial density used for the lambda given by the column.
`lambda`	A scalar representing the optimal smoothing parameter selected via k fold cross validation, if in the input there is a vector of parameters; the scalar given in input otherwise.
`data`	A matrix of dimensions #observations-by-ndim containing the data used in the algorithm. They are the same given in input if the domain is 2D pr 3D; they are the original data projected on the mesh if the domain is 2.5D.
`CV_err`	A vector of length `nfolds` containing the cross validation errors obtained in each fold, if `preprocess_method` is either `RightCV` or `SimplifiedCV`.

References

Ferraccioli, F., Arnone, E., Finos, L., Ramsay, J. O., Sangalli, L. M. (2021). Nonparametric density estimation over complicated domains. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2), 346-368.
Arnone, E., Ferraccioli, F., Pigolotti, C., Sangalli, L.M. (2021), A roughness penalty approach to estimate densities over two-dimensional manifolds, Computational Statistics and Data Analysis, to appear.

Examples

library(fdaPDE)

## Create a 2D mesh over a squared domain
Xbound <- seq(-3, 3, length.out = 10)
Ybound <- seq(-3, 3, length.out = 10)
grid_XY <- expand.grid(Xbound, Ybound)
Bounds <- grid_XY[(grid_XY$Var1 %in% c(-3, 3)) | (grid_XY$Var2 %in% c(-3, 3)), ]
mesh <- create.mesh.2D(nodes = Bounds, order = 1)
mesh <- refine.mesh.2D(mesh, maximum_area = 0.2)
FEMbasis <- create.FEM.basis(mesh)

## Generate data
n <- 50
set.seed(10)
data_x <- rnorm(n)
data_y <- rnorm(n)
data <- cbind(data_x, data_y)

plot(mesh)
points(data, col="red", pch=19, cex=0.5)

## Density Estimation
lambda = 0.1
sol <- DE.FEM(data = data, FEMbasis = FEMbasis, lambda = lambda, fvec=NULL, heatStep=0.1,
                  heatIter=500, nsimulations=300,step_method = "Fixed_Step", inference = TRUE)

## Visualization 
n = 100
X <- seq(-3, 3, length.out = n)
Y<- seq(-3, 3, length.out = n)
grid <- expand.grid(X, Y)

evaluation <- eval.FEM(FEM(FEMbasis, coeff = sol$g), locations = grid)
lower_bound_g <- eval.FEM(FEM(FEMbasis, coeff = sol$g_CI_L), locations = grid)
upper_bound_g <- eval.FEM(FEM(FEMbasis, coeff = sol$g_CI_U), locations = grid)
evaluation <- exp(evaluation)
lower_bound_g <- exp(lower_bound_g)
upper_bound_g <- exp(upper_bound_g)
eval <- matrix(evaluation, n, n)
eval_L <- matrix(lower_bound_g, n, n)
eval_U <- matrix(upper_bound_g, n, n)

image2D(x = X, y = Y, z = eval_L, col = heat.colors(100), xlab = "x", ylab = "y",
        contour = list(drawlabels = FALSE), main = "Estimated CI lower bound")
image2D(x = X, y = Y, z = eval, col = heat.colors(100), xlab = "x", ylab = "y", 
        contour = list(drawlabels = FALSE), main = "Estimated density")
image2D(x = X, y = Y, z = eval_U, col = heat.colors(100), xlab = "x", ylab = "y",
        contour = list(drawlabels = FALSE), main = "Estimated CI upper bound")

[Package fdaPDE version 1.1-19 Index]