mlazy {mvbutils} | R Documentation |
Cacheing objects for lazy-load access
Description
mlazy
and friends are designed for handling collections of biggish objects, where only a few of the objects are accessed during any period, and especially where the individual objects might change and the collection might grow or shrink. As with "lazy loading" of packages, and the gdata/ASOR
packages, the idea is to avoid the time & memory overhead associated with loading in numerous huge R binary objects when not all will be needed. Unlike lazy loading and gdata
, mlazy
caches each mlazyed object in a separate file, so it also avoids the overhead that would be associated with changing/adding/deleting objects if all objects lived in the same big file. When a workspace is Save
d, the code updates only those individual object files that need updating.
mlazy
does not require any special structure for object collections; in particular, the data doesn't have to go into a package. mlazy
is particularly useful for users of cd
because each cd
to/from a task causes a read/write of the binary image file (usually ".RData"), which can be very large if mlazy
is not used. Read DETAILS next. Feedback is welcome.
Usage
mlazy( ..., what, envir=.GlobalEnv, save.now=TRUE)
# cache some objects
mtidy( ..., what, envir=.GlobalEnv)
# (cache and) purge the cache to disk, freeing memory
demlazy( ..., what, envir=.GlobalEnv)
# makes 'what' into normal uncached objects
mcachees( envir=.GlobalEnv)
# shows which objects in envir are cached
attach.mlazy( dir, pos=2, name=)
# load mcached workspace into new search environment,
# or create empty s.e. for cacheing
Arguments
... |
unquoted object names, overridden by |
what |
character vector of object names, all from the same environment. For |
envir |
environment or position on the search path, defaulting to the environment where |
save.now |
see DETAILS |
dir |
name of directory, relative to |
pos |
numeric position of environment on search path, 2 or more |
name |
name to give environment, defaulting to something like "data:current.task:dir". |
Value
These functions are used only for their side-effects, except for cachees
which returns a character vector of object names.
More details
All this is geared to working with saved images (i.e. ".RData" or "all.rda" files) rather than creating all objects anew each session via source
. If you use the latter approach, mlazy
will probably be of little value.
The easiest way to set up cacheing is just to create your objects as normal, then call
mlazy( <<objname1>>, <<objname2>>, <<etc>>)
Save()
This will not seem to do much immediately– your object can be read and changed as normal, and is still taking up memory. The memory and time savings will come in your next R session in this workspace.
You should never see any differences (except in time & memory usage) between working with cached (AKA mlazyed) and normal uncached objects.[One minor exception is that cacheing a function may stuff up the automatic backup system, or at any rate the "backstop" version of it which runs when you cd
. This is deliberate, for speeding up cd
. But why would you cache a function anyway?]
mlazy
itself doesn't save the workspace image (the ".RData" or "all.rda" file), which is where the references live; that's why you need to call Save
periodically. save.image
and save
will not work properly, and nor will load
– see NOTE below. Save
doesn't store cached objects directly in the ".RData" file, but instead stores the uncached objects as normal in .RData
together with a special object called something like .mcache00
(guaranteed not to conflict with one of your own objects). When the .RData
file is subsequently reloaded by cd
, the presence of the .mcache00
object triggers the creation of "stub" objects that will load the real cached objects from disk when and only when each one is required; the .mcache00
object is then deleted. Cached objects are loaded & stored in a subdirectory "mlazy" from individual files called "obj*.rda", where "*" is a number.
mlazy
and Save
do not immediately free any memory, to avoid any unnecessary re-loading from disk if you access the objects again during the current session. To force a "memory purge" during an R session, you need to call mtidy
. mtidy
purges its arguments from the cache, replacing them by promise
s just as when loading the workspace; when a reference is next accessed, its cached version will be re-loaded from disk. mtidy
can be useful if you are looping over objects, and want to keep memory growth limited– you can mtidy
each object as the last statement in the loop. By default, mtidy
purges the cache of all objects that have previously been cached. mtidy
also caches any formerly uncached arguments, so one call to mtidy
can be used instead of mlazy( ...); mtidy( ...)
.
move
understands cached objects, and will shuffle the files accordingly.
demlazy
will delete the corresponding "obj*.rda" file(s), so that only an in-memory copy will then exist; don't forget to Save
soon after.
Warning
The system function load
does not understand cacheing. If you merely load
an image file saved using Save
, cached objects will not be there, but there will be an extra object called something like .mcache00
. Hence, if you have cached objects in your ROOT task, they will not be visible when you start R until you load the mvbutils
library– another fine reason to do that in your .First
. The .First.lib
function in mvbutils
calls setup.mcache( .GlobalEnv)
to automatically prepare any references in the ROOT task.
Cacheing in other search environments
It is possible to cache in search environments other the current top one (AKA the current workspace, AKA .GlobalEnv
). This could be useful if, for example, you have a large number of simulated datasets that you might need to access, but you don't want them cluttering up .GlobalEnv
. If you weren't worried about cacheing, you'd probably do this by calling attach( "<<filename>>")
. The cacheing equivalent is attach.mlazy( "cachedir")
. The argument is the name of a directory where the cached objects will be (or already are) stored; the directory will be created if necessary. If there is a ".RData" file in the directory, attach.mlazy
will load it and set up any references properly; the ".RData" file will presumably contain mostly references to cached data objects, but can contain normal uncached objects too.
Once you have set up a cacheable search environment via attach.mlazy
(typically in search position 2), you can cache objects into it using mlazy
with the envir
argument set (typically to 2). If the objects are originally somewhere else, they will be transferred to envir
before cacheing. Whenever you want to save the cached objects, call Save.pos(2)
.
You will probably also want to modify or create the .First.task
(see cd
) of the current task so that it calls attach.mlazy("<<cache directory name>>")
. Also, you should create a .Last.task
(see cd
) containing detach(2)
, otherwise cd(..)
and cd(0/...)
won't work.
Options
By default, mlazy
now saves & loads into a auto-created subdirectory called "mlazy". In the earliest releases, though, it saved "obj*.rda" files into the same directory as ".RData". It will now move any "obj*.rda" files that it finds alongside ".RData" into the "mlazy" subdirectory. You can (possibly) override this by setting options( mlazy.subdir=FALSE)
, but the default is likely more reliable.
By default, there is no way to figure out what object is contained in a "obj*.rda" without forcibly loading that file or inspecting the .mcache00
object in the "parent" .RData
file– not that you should ever need to know. However, if you set options( mlazy.index=TRUE)
(recommended), then a file "obj.ind" will be maintained in the "mlazy" directory, showing (object name - value) pairs in plain text (tab-separated). For directories with very large numbers of objects, there may be some speed penalty. If you want to create an index file for an existing "mlazy" directory that lacks one, cd
to the task and call mvbutils:::mupdate.mcache.index.if.opt(mlazy.index=TRUE)
.
See Save
for how to set compression options, and save
for what you can set them to; options(mvbutils.compression_level=1)
may save some time, at the expense of disk space.
Troubleshooting
In the unlikely event of needing to manually load a cached image file, use load.refdb
– cd
and attach.mlazy
do this automatically.
In the unlikely event of lost/corrupted data, you can manually reload individual "obj*.rda" files using load
– each "obj*.rda" file contains one object stored with its correct name. Before doing that, call demlazy( what=mcachees())
to avoid subsequent trouble. Once you have reloaded the objects, you can call mlazy
again.
See Options for the easy way to check what object is stored in a particular "obj*.rda" file. If that feature is turned off on your system, the failsafe way is to load the file into a new environment, e.g. e <- new.env(); load( "obj99.rda", e); ls( e)
.
To see how memory changes when you call mlazy
and mtidy
, call gc()
.
To check object sizes without actually loading the cached objects, use lsize
. Many functions that iterate over all objects in the environment, such as eapply
, will cause mlazy
objects to be loaded.
Housekeeping of "obj**.rda" files happens during Save
; any obsolete files (i.e. corresponding to objects that have been remove
d) are deleted.
Inner workings
What happens: each workspace acquires a mcache
attribute, which is a named numeric vector. The absolute values of the entries correspond to files– 53 corresponds to a file "obj53.rda", etc., and the names to objects. When an object myobj
is mlazy
ed, the mcache
is augmented by a new element named "myobj" with a new file number, and that file is saved to disk. Also, "myobj" is replaced with an active binding (see makeActiveBinding
). The active binding is a function which retrieves or sets the object's data within the function's environment. If the function is called in change-value mode, then it also makes negative the file number in mcache
. Hence it's possible to tell whether an object has been changed since last being saved.
When an object is first mlazy
ed, the object data is placed directly into the active binding function's environment so that the function can find/modify the data. When an object is mtidy
ed, or when a cached image is loaded from disk, the thing placed into the A.B.fun's environment is not the data itself, but instead a promise
saying, in effect, "fetch me from disk when you need me". The promise gets forced when the object is accessed for reading or writing. This is how "lazy loading" of packages works, and also the gdata package. However, for mlazy
there is the additional requirement of being able to determine whether an object has been modified; for efficiency, only modified objects should be written to disk when there is a Save
.
There is presumably some speed penalty from using a cache, but experience to date suggests that the penalty is small. Cached objects are saved in compressed format, which seems to take a little longer than an uncompressed save, but loading seems pretty quick compared to uncompressed files.
Author(s)
Mark Bravington
See Also
lsize
, gc
, package gdata, package ASOR
Examples
## Not run:
biggo <- matrix( runif( 1e6), 1000, 1000)
gc() # lots of memory
mlazy( biggo)
gc() # still lots of memory
mtidy( biggo)
gc() # better
biggo[1,1]
gc() # worse; it's been reloaded
## End(Not run)