biglassopackage {biglasso}  R Documentation 
Extending Lasso Model Fitting to Big Data
Description
Extend lasso and elasticnet linear, logistic and cox regression models for ultrahighdimensional, multigigabyte data sets that cannot be loaded into available RAM. This package utilizes memorymapped files to store the massive data on the disk and only read those into memory whenever necessary during model fitting. Moreover, some advanced feature screening rules are proposed and implemented to accelerate the model fitting. As a result, this package is much more memory and computationefficient and highly scalable as compared to existing lassofitting packages such as glmnet and ncvreg, thus allowing for powerful big data analysis even with only an ordinary laptop.
Details
Package:  biglasso 
Type:  Package 
Version:  1.41 
Date:  20210129 
License:  GPL3 
Penalized regression models, in particular the lasso, have been extensively applied to analyzing highdimensional data sets. However, due to the memory limit, existing R packages are not capable of fitting lasso models for ultrahighdimensional, multigigabyte data sets which have been increasingly seen in many areas such as genetics, biomedical imaging, genome sequencing and highfrequency finance.
This package aims to fill the gap by extending lasso model fitting to Big
Data in R. Version >= 1.23 represents a major redesign where the source
code is converted into C++ (previously in C), and new feature screening
rules, as well as OpenMP parallel computing, are implemented. Some key
features of biglasso
are summarized as below:
it utilizes memorymapped files to store the massive data on the disk, only loading data into memory when necessary during model fitting. Consequently, it's able to seamlessly datalargerthanRAM cases.
it is built upon pathwise coordinate descent algorithm with warm start, active set cycling, and feature screening strategies, which has been proven to be one of fastest lasso solvers.
in incorporates our newly developed hybrid and adaptive screening that outperform stateoftheart screening rules such as the sequential strong rule (SSR) and the sequential EDPP rule (SEDPP) with additional 1.5x to 4x speedup.
the implementation is designed to be as memoryefficient as possible by eliminating extra copies of the data created by other R packages, making it at least 2x more memoryefficient than
glmnet
.the underlying computation is implemented in C++, and parallel computing with OpenMP is also supported.
For more information:
Benchmarking results: https://github.com/pbreheny/biglasso
Tutorial: https://pbreheny.github.io/biglasso/articles/biglasso.html
Technical paper: https://arxiv.org/abs/1701.05936
Note
The input design matrix X must be a bigmemory::big.matrix()
object.
This can be created by the function as.big.matrix
in the R package
bigmemory.
If the data (design matrix) is very large (e.g. 10 GB) and stored in an external
file, which is often the case for big data, X can be created by calling the
function setupX()
.
In this case, there are several restrictions about the data file:
the data file must be a wellformated ASCIIfile, with each row corresponding to an observation and each column a variable;

the data file must contain only one single type. Current version only supports
double
type; the data file must contain only numeric variables. If there are categorical variables, the user needs to create dummy variables for each categorical varable (by adding additional columns).
Future versions will try to address these restrictions.
Denote the number of observations and variables be, respectively, n
and p
. It's worth noting that the package is more suitable for wide
data (ultrahighdimensional, p >> n
) as compared to long data
(n >> p
). This is because the model fitting algorithm takes advantage
of sparsity assumption of highdimensional data. To just give the user some
ideas, below are some benchmarking results of the total computing time (in
seconds) for solving lassopenalized linear regression along a sequence of
100 values of the tuning parameter. In all cases, assume 20 nonzero
coefficients equal +/ 2 in the true model. (Based on Version 1.23,
screening rule "SSRBEDPP" is used)
For wide data case (
p > n
),n = 1,000
:p
1,000 10,000 100,000 1,000,000 Size of X
9.5 MB 95 MB 950 MB 9.5 GB Elapsed time (s) 0.11 0.83 8.47 85.50 %
For long data case (
n >> p
),p = 1,000
: %% n
1,000 10,000 100,000 1,000,000 %Size of X
9.5 MB 95 MB 950 MB 9.5 GB %Elapsed time (s) 2.50 11.43 83.69 1090.62 %
Author(s)
Yaohui Zeng, Chuyi Wang, Tabitha Peter, and Patrick Breheny
References
Zeng, Y., and Breheny, P. (2017). The biglasso Package: A Memory and ComputationEfficient Solver for Lasso Model Fitting with Big Data in R. https://arxiv.org/abs/1701.05936.

Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. (2012). Strong rules for discarding predictors in lassotype problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), 245266.
Wang, J., Zhou, J., Wonka, P., and Ye, J. (2013). Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, pp. 10701078.
Xiang, Z. J., and Ramadge, P. J. (2012). Fast lasso screening tests based on correlations. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on (pp. 21372140). IEEE.
Wang, J., Zhou, J., Liu, J., Wonka, P., and Ye, J. (2014). A safe screening rule for sparse logistic regression. In Advances in Neural Information Processing Systems, pp. 10531061.
Examples
## Not run:
## Example of reading data from external big data file, fit lasso model,
## and run cross validation in parallel
# simulated design matrix, 1000 observations, 500,000 variables, ~ 5GB
# there are 10 true variables with nonzero coefficient 2.
xfname < 'x_e3_5e5.txt'
yfname < 'y_e3_5e5.txt' # response vector
time < system.time(
X < setupX(xfname, sep = '\t') # create backing files (.bin, .desc)
)
print(time) # ~ 7 minutes; this is just onetime operation
dim(X)
# the big.matrix then can be retrieved by its descriptor file (.desc) in any new R session.
rm(X)
xdesc < 'x_e3_5e5.desc'
X < attach.big.matrix(xdesc)
dim(X)
y < as.matrix(read.table(yfname, header = F))
time.fit < system.time(
fit < biglasso(X, y, family = 'gaussian', screen = 'Hybrid')
)
print(time.fit) # ~ 44 seconds for fitting a lasso model along the entire solution path
# cross validation in parallel
seed < 1234
time.cvfit < system.time(
cvfit < cv.biglasso(X, y, family = 'gaussian', screen = 'Hybrid',
seed = seed, ncores = 4, nfolds = 10)
)
print(time.cvfit) # ~ 3 minutes for 10fold cross validation
plot(cvfit)
summary(cvfit)
## End(Not run)