grandTourNugg {PPbigdata}R Documentation

1-dim/2-dim Grand Tour for Big Data based on Data Nuggets

Description

This function performs a 1-dim/2-dim grand tour path for big data based on constructed data nuggets. The grand tour finds projections at random.

Usage

grandTourNugg(nugg, weight, dim, qt = 0.8,...)

Arguments

nugg

Data nugget centers obtained from raw data. Must be a data matrix (of class matrix, or data.frame) with at least two columns.

weight

Vector of the weight parameter for each data nugget. Its length should be the same as the number of data nuggets, i.e., nrow(nugg). Must be of class numeric or integer.

dim

A numerical value indicating the target dimensionality for the tour. It's either 1 or 2.

qt

For projected plots of 2-dim tour, a scalar with value in [0,1] indicating the probability used to obtain a sample quantile of the data nugget weights as the maximal value to transform weights to colors for the plot. Defaults to be 0.8. See plotNugg.

...

Other arguments sent to animate, animate_xy, display_dist, display_xy, wtd.hist.

Details

This function performs a 1-dim/2-dim grand tour path for big data based on constructed data nuggets. The grand tour finds projections randomly.

Data nuggets are a representative sample meant to summarize Big Data by reducing a large dataset to a much smaller dataset by eliminating redundant points while also preserving the peripheries of the dataset. Each data nugget is defined by a center (location), weight (importance), and scale (internal variability). Data nuggets for a large dataset could be created and refined by functions create.DN or refine.DN in the package datanugget.

Based on the data nuggets from big data, a grand tour is performed to explore the multivariate data. It walks randomly to discover 1-dim/2-dim projections. This function for data nuggets is based on functions about grand tour in the package tourr. See details in grand_tour, animate, animate_xy, display_dist, and display_xy. For 2-dim grand tour, the projected data nugget centers are plotted with colors based on their weights where lighter green represents larger weights. For 1-dim grand tour, a weighted density histgram of 1-dim projected data nugget centers is plotted considering the data nugget weights. See details in wtd.hist. The loadings of each variable for projections are also shown at each step.

Value

A list containing the bases, index values, and other information during the tour.

Author(s)

Yajie Duan, Javier Cabrera

References

Cook, D., Buja, A., Cabrera, J., & Hurley, C. (1995). Grand tour and projection pursuit. Journal of Computational and Graphical Statistics, 4(3), 155-172.

Cook, D., Buja, A., & Cabrera, J. (1993). Projection pursuit indexes based on orthonormal function expansions. Journal of Computational and Graphical Statistics, 2(3), 225-250.

Wickham, H., Cook, D., Hofmann, H., & Buja, A. (2011). tourr: An R package for exploring multivariate data with projections. Journal of Statistical Software, 40, 1-18.

Beavers, T. E., Cheng, G., Duan, Y., Cabrera, J., Lubomirski, M., Amaratunga, D., & Teigler, J. E. (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure. Journal of Computational and Graphical Statistics, (just-accepted), 1-21.

Duan, Y., Cabrera, J., & Emir, B. (2023). A New Projection Pursuit Index for Big Data. ArXiv:2312.06465. https://doi.org/10.48550/arXiv.2312.06465

See Also

PPnugg, NHnugg,create.DN, refine.DN, guided_tour, animate

Examples


  require(datanugget)

  #4-dim small example with cluster stuctures in V3 and V4
  X = cbind.data.frame(V1 = rnorm(5*10^3,mean = 5,sd = 2),
                       V2 = rnorm(5*10^3,mean = 5,sd = 1),
                       V3 = c(rnorm(3*10^3,sd = 0.3),
                              rnorm(2*10^3,mean = 2, sd = 0.3)),
                       V4 = c(rnorm(1*10^3,mean = -8, sd = 1),
                              rnorm(3*10^3,mean = 0,sd = 1),
                              rnorm(1*10^3,mean = 7, sd = 1.5)))

  #raw data is recommended to be scaled firstly to generate data nuggets for Projection Pursuit
  X = as.data.frame(scale(X))

  #create data nuggets
  my.DN = create.DN(x = X,
                    R = 500,
                    delete.percent = .1,
                    DN.num1 = 500,
                    DN.num2 = 250,
                    no.cores = 0,
                    make.pbs = FALSE)


  #refine data nuggets
  my.DN2 = refine.DN(x = X,
                     DN = my.DN,
                     EV.tol = .9,
                     min.nugget.size = 2,
                     max.splits = 5,
                     no.cores = 0,
                     make.pbs = FALSE)

  #get nugget centers, weights, and scales
  nugg = my.DN2$`Data Nuggets`[,2:(ncol(X)+1)]
  weight = my.DN2$`Data Nuggets`$Weight
  scale = my.DN2$`Data Nuggets`$Scale

  #2-dim grand tour based on data nuggets
  grandTourNugg(nugg,weight,dim = 2,cex = 0.5)

  #1-dim grand tour based on data nuggets
  grandTourNugg(nugg,weight,dim = 1,density_max = 4.5)


[Package PPbigdata version 1.0.0 Index]