supercompress {supercompress} | R Documentation |
Supervised Data Compression via Clustering.
Description
supercompress
is the supervised data compression method proposed in Joseph
and Mak (2021). It is a nonparametric compression method that incorporates
information of the response.
Usage
supercompress(n, x, y, lam = 0, standardize = TRUE)
Arguments
n |
number of compressed data points |
x |
features of the input big data |
y |
responses of the input big data |
lam |
robustness parameter takes value between 0 (fully supervised) and 1 (fully unsupervised) |
standardize |
should the big data be normalized to have zero mean unit variance |
Details
The supercompress
algorithm finds the n
compressed points by
sequentially splitting the space into n
Voronoi regions with centers
being the n
compressed points. The splitting is done to minimize the
total within-cluster sum of squares. The parameter lam
controls the robustness of the splitting, with value 0 being fully
supervised (objective based on response y
only) and value 1 being fully
unsupervised (objective based on feature x
only), where the latter case
reduces to the kmeans clustering. The Vornoi regions are identified
by the fast nearest neighbor search implemented in the R package FNN
.
Only continuous response and features are supported at this time.
Default is to standardize the big data to have zero mean and unit variance
before processing. Please see Joseph and Mak (2021) for details.
Value
D |
features of compressed data points |
ybar |
responses of compressed data points |
cluster |
a vector of integers indicating assignment of each point to its nearest compressed data point |
l2 |
the total sum of squares |
Author(s)
Chaofan Huang and V. Roshan Joseph
References
Joseph, V. R. and Mak, S. (2021). Supervised compression of big data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 14(3), 217-229.
Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., and Li, S. (2019). FNN: Fast nearest neighbor search algorithms and applications, R 1.1.3.
Examples
#########################################################################
# One dimensional example
#########################################################################
# generate big data
set.seed(1)
N <- 3000
x <- seq(0,1,length=N)
f <- function(x) dnorm(x, mean = 0.4, sd = 0.01)
y <- f(x) + 0.1 * rnorm(N)
x <- matrix(x, ncol=1)
# visualize big data
plot(x,y,cex=.5,main="Big Data",cex.main=3,xlab="x",ylab="y",cex.lab=2, cex.axis=2)
# big data reduction via supercompress
n <- 30
sc <- supercompress(n,x,y,lam=0)
D <- sc$D # reduced data point input features
ybar <- sc$ybar # reduced data point response
points(cbind(D, ybar), pch=4,col=4,lwd=4, cex=1.5)
#########################################################################
# Two dimensional Michaelwicz function
#########################################################################
f=function(x) {
p=length(x)
x=pi*x
val=-sum(sin(x)*(sin((1:p)*x^2/pi))^(20))
return(val)
}
# generate big data
p=2
N=10000*p
set.seed(1)
x=NULL
for(i in 1:p) x=cbind(x,runif(N))
y=apply(x,1,f)+.0001*rnorm(N)
true=apply(x,1,f)
# groundtruth
N.plot=250
p1=seq(0,1,length=N.plot)
p2=seq(0,1,length=N.plot)
fc=matrix(apply(expand.grid(p1,p2),1,f),nrow = N.plot, ncol= N.plot)
# big data reduction via supercompress
n <- 100
sc <- supercompress(n,x,y,lam=1/(1+p))
D <- sc$D # reduced data point input features
ybar <- sc$ybar # reduced data point response
image(p1,p2,fc,col=cm.colors(5),xlab=expression(x[1]),ylab=expression(x[2]),
main="robust-supervised",cex.main=3,cex.lab=2, cex.axis=2)
points(D,pch=16,col=4,cex=2)