NHnugg {PPbigdata} | R Documentation |
Natural Hermite Index for projected 1-dim/2-dim Big Data based on data nuggets
Description
This function calculates the value of Nature Hermite index, a Projection Pursuit index proposed by Cook(1993) for projected 1-dim/2-dim big data based on data nuggets.
Usage
NHnugg(nuggproj, weight, scale,
bandwidth = NULL, gridn = 300,lims = NULL, gridnAd = TRUE)
Arguments
nuggproj |
Projected data nugget centers in 1-dim/2-dim space. Must be a data matrix (of class matrix, or data.frame) with two columns or a vector containing only entries of class numeric. |
weight |
Vector of the weight parameter for each data nugget. Its length should be the same as the number of data nuggets, i.e., nrow(nuggproj) for 2-dim projection/length(nuggproj) for 1-dim projection. Must be of class numeric or integer. |
scale |
Vector of the scale parameter for each data nugget. Its length should be the same as the number of data nuggets, i.e., nrow(nuggproj) for 2-dim projection/length(nuggproj) for 1-dim projection. Must be of class numeric or integer. |
bandwidth |
Bandwidth in each direction that would be combined with data nuggets scales as the final bandwith for kernal density estimation of projected data nuggets. Defaults to normal reference bandwidth considering the weights. Can be scalar or a length-2 numeric vector. For 2-dim projection, a scalar value will be applied on both directions. |
gridn |
Number of grid points in each direction used for kernel density estimation of projected data. Can be scalar or a length-2 integer vector. |
lims |
The limits of each direction used for kernel density estimation of projected data. Must be a length-4 numeric vector as (xlow, xupper, ylow, yupper) for 2-dim projected data, or a length-2 numeric vector as (xlow, xupper) for 1-dim projected data. If NULL, defaulting to the range of each direction. |
gridnAd |
logical; if |
Details
This function calculates the value of Nature Hermite index, a Projection Pursuit index proposed by Cook(1993) for projected 1-dim/2-dim Big Data based on data nuggets.
Data nuggets are a representative sample meant to summarize Big Data by reducing a large dataset to a much smaller dataset by eliminating redundant points while also preserving the peripheries of the dataset. Each data nugget is defined by a center (location), weight (importance), and scale (internal variability). Data nuggets for a large dataset could be created and refined by functions create.DN
or refine.DN
in the package datanugget
.
Natural Hermite index is one kind of Projection Pursuit (PP) index, and it measures the distance between the density of projected data and the standard normal. Larger index values indicate a hidden structure of multivariate data, such as clustersing, outliers or other non-linear structures. However, it's computationally hard to calculate the index for big data because of the issue about density estimation of projected big data. A new PP index for big data was proposed by Duan(2023), which is based on the Natural Hermite index and data nuggets.
In this function, the PP index value for projected 1-dim/2-dim big data is calculated based on created and refined data nuggets. Data nuggets are firstly created and refined for the big data. For Natural Hermite index, the data nugget centers need to be spherized considering nugget weights before projection. The projection is taken on the spherized data nugget centers to obtain projected data nuggets. The density values of projected big data are firstly estimated by nuggKDE
. Based on it, the Natural Hermite index value is calculated via numerical integral by summation.
Value
A numeric value indicating Nature Hermite index value of the projected big data based on the data nuggets.
Author(s)
Yajie Duan, Javier Cabrera
References
Cook, D., Buja, A., & Cabrera, J. (1993). Projection pursuit indexes based on orthonormal function expansions. Journal of Computational and Graphical Statistics, 2(3), 225-250.
Cook, D., Buja, A., Cabrera, J., & Hurley, C. (1995). Grand tour and projection pursuit. Journal of Computational and Graphical Statistics, 4(3), 155-172.
Beavers, T. E., Cheng, G., Duan, Y., Cabrera, J., Lubomirski, M., Amaratunga, D., & Teigler, J. E. (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure. Journal of Computational and Graphical Statistics, (just-accepted), 1-21.
Duan, Y., Cabrera, J., & Emir, B. (2023). A New Projection Pursuit Index for Big Data. ArXiv:2312.06465. https://doi.org/10.48550/arXiv.2312.06465
See Also
Examples
require(datanugget)
require(rstiefel)
#4-dim small example
X = cbind.data.frame(rnorm(5*10^3),
rnorm(5*10^3,2,1),
rnorm(5*10^3,5,2),
rnorm(5*10^3))
#raw data is recommended to be scaled firstly to generate data nuggets for Projection Pursuit
X = as.data.frame(scale(X))
#create data nuggets
my.DN = create.DN(x = X,
R = 500,
delete.percent = .1,
DN.num1 = 500,
DN.num2 = 250,
no.cores = 0,
make.pbs = FALSE)
#refine data nuggets
my.DN2 = refine.DN(x = X,
DN = my.DN,
EV.tol = .9,
min.nugget.size = 2,
max.splits = 5,
no.cores = 0,
make.pbs = FALSE)
#get nugget centers, weights, and scales
nugg = my.DN2$`Data Nuggets`[,2:(ncol(X)+1)]
weight = my.DN2$`Data Nuggets`$Weight
scale = my.DN2$`Data Nuggets`$Scale
#spherize the data nuggets with weights to calculate the PP index
nugg_wsph <- wsph(nugg,weight)$data_wsph
#generate a random orthonormal matrix as a projection matrix to 2-dim space
proj_2d = rustiefel(4, 2)
#project data nugget centers into 2-dim space by the random projection matrix
nuggproj_2d = as.matrix(nugg_wsph)%*%proj_2d
#plot the projected data nuggets
#lighter green represents larger weights
plotNugg(nuggproj_2d, weight)
#calculate the Natural Hermite index for the projected 2-dim big data
NHnugg(nuggproj_2d,weight,scale)