grandTourNugg {PPbigdata} | R Documentation |
1-dim/2-dim Grand Tour for Big Data based on Data Nuggets
Description
This function performs a 1-dim/2-dim grand tour path for big data based on constructed data nuggets. The grand tour finds projections at random.
Usage
grandTourNugg(nugg, weight, dim, qt = 0.8,...)
Arguments
nugg |
Data nugget centers obtained from raw data. Must be a data matrix (of class matrix, or data.frame) with at least two columns. |
weight |
Vector of the weight parameter for each data nugget. Its length should be the same as the number of data nuggets, i.e., nrow(nugg). Must be of class numeric or integer. |
dim |
A numerical value indicating the target dimensionality for the tour. It's either 1 or 2. |
qt |
For projected plots of 2-dim tour, a scalar with value in |
... |
Other arguments sent to |
Details
This function performs a 1-dim/2-dim grand tour path for big data based on constructed data nuggets. The grand tour finds projections randomly.
Data nuggets are a representative sample meant to summarize Big Data by reducing a large dataset to a much smaller dataset by eliminating redundant points while also preserving the peripheries of the dataset. Each data nugget is defined by a center (location), weight (importance), and scale (internal variability). Data nuggets for a large dataset could be created and refined by functions create.DN
or refine.DN
in the package datanugget
.
Based on the data nuggets from big data, a grand tour is performed to explore the multivariate data. It walks randomly to discover 1-dim/2-dim projections. This function for data nuggets is based on functions about grand tour in the package tourr
. See details in grand_tour
, animate
, animate_xy
, display_dist
, and display_xy
. For 2-dim grand tour, the projected data nugget centers are plotted with colors based on their weights where lighter green represents larger weights. For 1-dim grand tour, a weighted density histgram of 1-dim projected data nugget centers is plotted considering the data nugget weights. See details in wtd.hist
. The loadings of each variable for projections are also shown at each step.
Value
A list containing the bases, index values, and other information during the tour.
Author(s)
Yajie Duan, Javier Cabrera
References
Cook, D., Buja, A., Cabrera, J., & Hurley, C. (1995). Grand tour and projection pursuit. Journal of Computational and Graphical Statistics, 4(3), 155-172.
Cook, D., Buja, A., & Cabrera, J. (1993). Projection pursuit indexes based on orthonormal function expansions. Journal of Computational and Graphical Statistics, 2(3), 225-250.
Wickham, H., Cook, D., Hofmann, H., & Buja, A. (2011). tourr: An R package for exploring multivariate data with projections. Journal of Statistical Software, 40, 1-18.
Beavers, T. E., Cheng, G., Duan, Y., Cabrera, J., Lubomirski, M., Amaratunga, D., & Teigler, J. E. (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure. Journal of Computational and Graphical Statistics, (just-accepted), 1-21.
Duan, Y., Cabrera, J., & Emir, B. (2023). A New Projection Pursuit Index for Big Data. ArXiv:2312.06465. https://doi.org/10.48550/arXiv.2312.06465
See Also
PPnugg
, NHnugg
,create.DN
, refine.DN
, guided_tour
, animate
Examples
require(datanugget)
#4-dim small example with cluster stuctures in V3 and V4
X = cbind.data.frame(V1 = rnorm(5*10^3,mean = 5,sd = 2),
V2 = rnorm(5*10^3,mean = 5,sd = 1),
V3 = c(rnorm(3*10^3,sd = 0.3),
rnorm(2*10^3,mean = 2, sd = 0.3)),
V4 = c(rnorm(1*10^3,mean = -8, sd = 1),
rnorm(3*10^3,mean = 0,sd = 1),
rnorm(1*10^3,mean = 7, sd = 1.5)))
#raw data is recommended to be scaled firstly to generate data nuggets for Projection Pursuit
X = as.data.frame(scale(X))
#create data nuggets
my.DN = create.DN(x = X,
R = 500,
delete.percent = .1,
DN.num1 = 500,
DN.num2 = 250,
no.cores = 0,
make.pbs = FALSE)
#refine data nuggets
my.DN2 = refine.DN(x = X,
DN = my.DN,
EV.tol = .9,
min.nugget.size = 2,
max.splits = 5,
no.cores = 0,
make.pbs = FALSE)
#get nugget centers, weights, and scales
nugg = my.DN2$`Data Nuggets`[,2:(ncol(X)+1)]
weight = my.DN2$`Data Nuggets`$Weight
scale = my.DN2$`Data Nuggets`$Scale
#2-dim grand tour based on data nuggets
grandTourNugg(nugg,weight,dim = 2,cex = 0.5)
#1-dim grand tour based on data nuggets
grandTourNugg(nugg,weight,dim = 1,density_max = 4.5)