ca,cabase,calm,caglm,caprcomp,cakm,cameans,caquantile,caagg,caknn {partools} | R Documentation |
Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones
Description
Easy parallelization of most statistical computations.
Usage
ca(cls,z,ovf,estf,estcovf=NULL,findmean=TRUE,scramble=FALSE)
cabase(cls,ovf,estf,estcovf=NULL,findmean=TRUE,cacall=FALSE,z=NULL,scramble=FALSE)
calm(cls,lmargs)
caglm(cls,glmargs)
caprcomp(cls,prcompargs, p)
cakm(cls,mtdf,ncenters,p)
cameans(cls,cols,na.rm=FALSE)
caquantile(cls,vec, probs = c(0.25, 0.5, 0.75),na.rm=FALSE)
caagg(cls,ynames,xnames,dataname,FUN)
caknn(cls, yname, k, xname='')
Arguments
cls |
A cluster run under the parallel package. |
z |
A data frame, matrix or vector, one observation per row/element. |
ovf |
Overall statistical function, say |
estf |
Function to extract the point estimate (typically
vector-valued) from the output of |
estcovf |
If provided, function to extract the estimated
covariance matrix of the output of |
.
findmean |
If TRUE, output the average of the estimates from the chunks; otherwise, output only the estimates themselves. |
lmargs |
Quoted string representing arguments to |
glmargs |
Quoted string representing arguments to |
prcompargs |
Quoted string representing arguments to
|
p |
Number of columns in data |
na.rm |
If TRUE, remove NA values from the analysis. |
mtdf |
Quoted name of a distributed matrix or data frame. |
ncenters |
Number of clusters to find. |
cacall |
If TRUE, indicates that |
scramble |
If this and |
cols |
A quoted string that evaluates to a data frame or matrix. |
vec |
A quoted string that evaluates to a vector. |
yname |
A quoted variable name, for the Y vector. |
k |
Number of nearest neighbors. |
xname |
A quoted variable name, for the X matrix/data frame. If
empty, it is assumed that |
ynames |
A vector of quoted variable names. |
xnames |
A vector of quoted variable names. |
dataname |
Quoted name of a data frame or matrix. |
probs |
As in the argument with the same name in
|
FUN |
Quoted name of a function. |
Details
Implements the “Software Alchemy” (SA) method for parallelizing statistical computations (N. Matloff, Parallel Computation for Data Science, Chapman and Hall, 2015, with further details in N. Matloff, Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones, Journal of Statistical Software, 2016.) This can result in substantial speedups in computation, as well as address limits on physical memory.
The method involves breaking the data into chunks, and then applying the given estimator to each one. The results are averaged, and an estimated covariance matrix computed (optional).
Except for ca
, it is assumed that the chunking has already been
done, say via distribsplit
or readnscramble
.
Note that in cabase
, the data object is not specified explicitly
in the argument list. This is done through the function ovf
.
Key point: The SA estimator is statistically equivalent to the original, nonparallel one, in the sense that they have the SAME asymptotic statistical accuracy. Neither the non-SA nor the SA estimator is "better" than the other, and usually they will be quite close to each other anyway. Since we would use SA only with large data sets anyway (otherwise, parallel computation would not be needed for speed), the asymptotic aspect should not be an issue. In other words, with SA we achieve the same statistical accuracy while possibly attaining much faster computation.
It is vital to keep in mind that The memory space issue can be just as important as run time. Even if the problem is run on many cores, if the total memory space needed exceeds that of the machine, the run may fail.
Wrapper functions, applying SA to the corresponding R function (or function elsewere in this package):
-
calm
: Wrapper forlm
. -
caglm
: Wrapper forglm
. -
caprcomp
: Wrapper forprcomp
. -
cakm
: Wrapper forkmeans
. -
cameans
: Wrapper forcolMeans
. -
caquantile
: Wrapper forquantile
. -
caagg
: Likedistribagg
, but finds the average value ofFUN
across the cluster nodes.
A note on NA values: Some R functions such as lm
, glm
and
prcomp
have an na.action
argument. The default is
na.omit
, which means that cases with at least one NA value will
be discarded. (This is also settable via options()
.) However,
na.omit
seems to have no effect in prcomp
unless that
function's formula
option is used. When in doubt, apply the
function na.omit
directly; e.g. na.omit(d)
for a data
frame d
returns a data frame consisting of only the intact rows of
d
.
The method assumes that the base estimator is asymptotically normal, and
assumes i.i.d. data. If your data set had been stored in some sorted
order, it must be randomized first, say using the scramble
option
in distribsplit
or by calling readnscramble
, depending on
whether your data is already in memory or still in a file.
Value
R list with these components:
-
thts
, the results of applying the requested estimator to the chunks; the estimator from chunk i is in row i -
tht
, the chunk-averaged overall estimator, if requested -
thtcov
, the estimated covariance matrix oftht
, if available
The wrapper functions return the following list elements:
-
calm
,caglm
: estimated regression coefficients and their estimated covariance matrix -
caprcomp
:sdev
(square roots of the eigenvalues) androtation
, as withprcomp
;thts
is returned as well. -
cakm
:centers
andsize
, as withkmeans
;thts
is returned as well.
The wrappers that return thts
are useful for algorithms that may
expose some instability in the original (i.e. non-SA) algorithm. With
prcomp
, for instance, the eigenvectors corresponding to the
smaller eigenvalues may have high variances in the nonparallel version,
which will be reflected in large differences from chunk to chunk in SA,
visible in thts
. Note that this reflects a fundamental problem
with the algorithm on the given data set, not due to Software Alchemy;
on the contrary, an important advantage of the SA approach is to expose
such problems.
Author(s)
Norm Matloff
References
N. Matloff N (2016). "Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones." Journal of Statistical Software, 71(4), 1-15.
Examples
# set up 'parallel' cluster
cls <- makeCluster(2)
setclsinfo(cls)
# generate simulated test data, as distributed data frame
n <- 10000
p <- 2
tmp <- matrix(rnorm((p+1)*n),nrow=n)
u <- tmp[,1:p] # "X" values
# add a "Y" col
u <- cbind(u,u %*% rep(1,p) + tmp[,p+1])
# now in u, cols 1,2 are the "X" variables, and col 3 is "Y",
# with regress coefs (0,1,1), with tmp[,p+1] being the error term
distribsplit(cls,"u") # form distributed d.f.
# apply the function
#### calm(cls,"u[,3] ~ u[,1]+u[,2]")$tht
calm(cls,"V3 ~ .,data=u")$tht
# check; results should be approximately the same
lm(u[,3] ~ u[,1]+u[,2])
# without the wrapper
ovf <- function(dummy=NULL) lm(V3 ~ .,data=z168)
ca(cls,u,ovf,estf=coef,estcovf=vcov)$tht
## Not run:
# Census data on programmers and engineers; include a quadratic term for
# age, due to nonmonotone relation to income
data(prgeng)
distribsplit(cls,"prgeng")
caout <- calm(cls,"wageinc ~ age+I(age^2)+sex+wkswrkd,data=prgeng")
caout$tht
# compare to nonparallel
lm(wageinc ~ age+I(age^2)+sex+wkswrkd,data=prgeng)
# get standard errors of the beta-hats
sqrt(diag(caout$thtcov))
# find mean age for all combinations of the cit and sex variables
caagg(cls,"age",c("cit","sex"),"prgeng","mean")
# compare to nonparallel
aggregate(age ~ cit+sex,data=prgeng,mean)
data(newadult)
distribsplit(cls,"newadult")
caglm(cls," gt50 ~ ., family = binomial,data=newadult")$tht
caprcomp(cls,'newadult,scale=TRUE',5)$sdev
prcomp(newadult,scale=TRUE)$sdev
cameans(cls,"prgeng")
cameans(cls,"prgeng[,c('age','wageinc')]")
caquantile(cls,'prgeng$age')
pe <- prgeng[,c(1,3,8)]
distribsplit(cls,"pe")
z1 <- cakm(cls,'pe',3,3); z1$size; z1$centers
# check algorithm unstable
z1$thts # looks unstable
pe <- prgeng
pe$ms <- as.integer(pe$educ == 14)
pe$phd <- as.integer(pe$educ == 16)
pe <- pe[,c(1,7,8,9,12,13)]
distribsplit(cls,'pe',scramble=TRUE)
kout <- caknn(cls,'pe[,3]',50,'pe[,-3]')
## End(Not run)
stopCluster(cls)