We can use the chunk option
cache=TRUE to enable cache, and the option
cache.path can be used to set the cache directory. See the options page.
The cache feature is used extensively in many of my documents, e.g. you can find it in the knitr main manual or its graphics manual. Here are a few more examples:
- basic examples
- automatic dependencies
- Rnw source: knitr-dependson.Rnw
- with the chunk option
autodep=TRUEand the function
dep_auto(), knitr can figure out the dependencies among chunks automatically, which may save some manual efforts to specify the
You have to read the section on cache in the main manual very carefully to understand when cache will be rebuilt and which chunks should not be cached.
Let me repeat the three factors that can affect cache (any change on them will invalidate old cache):
- all chunk options except
include; e.g. change
FALSEwill break the old cache, but changing
- R code in a chunk; a tiny change in the R code will lead to removal of old cache, even if it is a change of a space or a blank line
- the R option
It is extremely important to note that usually a chunk that has side-effects should not be cached. Although knitr tries to retain the side-effects from
print(), there are still other side-effects that are not preserved. Here are some cases that you must not use cache for a chunk:
- setting R options like
pdf.options()or any other options in knitr like
- loading packages via
library()in a cached chunk and these packages will be used by uncached chunks (it is entirely OK to load packages in a cached chunk and use them only in cached chunks because knitr saves the list of packages for cached chunks, but uncached chunks are unable to know which packages are loaded in previous cached chunks)
Otherwise next time the chunk will be skipped and all the settings in it will be ignored. You have to use
cache=FALSE explicitly for these chunks.
setGeneric() have side effects of creating objects in the global environment even if the code is evaluated in a local environment. Before knitr v0.4, it was unable to cache these global objects (e.g. issue #138), but since v0.4, they can be cached as well because knitr checks newly created objects in
globalenv() and save them as well.
Although the list of packages used in cached chunks is saved, this is not a perfect way of caching package names: if you loaded a package but removed it later, knitr will be unable to know it (knitr is only able to capture newly loaded packages). You have to manually edit the
__packages file under the cache directory as described in #382.
Even more stuff for cache?
While the above objects seem reasonable to affect cache, reproducible research may be even more rigorous in the sense that cache can be invalidated by other changes. One typical example is the version of software; it is not impossible for two different versions of R to give you different results. In this case, we may set
opts_chunk$set(cache.extra = R.version.string) opts_chunk$set(cache.extra = R.version) # or even consider platform
so the cached results are only applicable to a specific version of R. When you upgrade R and recompile the document, all the results will be re-computed.
Similarly you can put more variables into this option so that the cache is preserved only given environments. Here is an ambitious example:
## cache is only valid with a specific version of R and session info ## cache will be kept for at most a month (re-compute the next month) opts_chunk$set(cache.extra = list( R.version, sessionInfo(), format(Sys.Date(), '%Y-%m') ))
The issue #238 shows another good use of this option: the cache is associated with the file modification time, i.e. when the data file is modified, the cache will be rebuilt automatically.
Note you can actually use any option name other than
cache.extra to introduce more objects into the cache condition, e.g. you can call it
cache.validation. The reason is that all chunk options are taken into account when validating the cache.
Associate cache directory with the input filename
Sometimes we may want to use different cache directories for different input files by default, and there is one solution in issue #234. However, I still recommend you to do this setting inside your source document to make it self-contained (use
opts_chunk$set(cache.path = ...)).
More granular cache
FALSE for the chunk option
cache, advanced users can
also consider more granular cache by using numeric values for
cache = 0, 1,
3 is equivalent to
cache = 1,
the results of the computation (from
evaluate::evaluate()) are loaded from
the cache, so the code is not evaluated again, but everything else is still
executed, such the output hooks and saving recorded plots to files. For
cache = 2, it is very similar to
1, and the only difference is that the
recorded plots will not be resaved to files when the plot files already
exist, which might save some time when the plots are big. It is recommended
cache = 2 instead of
1, because there is no guarantee that
recorded plots in a previous R session can be safely resaved in another R
session, or using another version of R.
cache = 1, 2, only a few chunk options affect the cache; see
knitr:::cache2.opts for the option names.
Basically, the cache will not be invalidated if a chunk option that does not
affect the code evaluation is changed. For example, we change
FALSE, or set
fig.cap = 'a new caption'; however, if we change
'bar/', the cache
has to be rebuilt.
See the example #101 (output) for some examples.
In this way, we can separate the computing from document output rendering, and it can be useful to tweak the output without breaking the cache. See #396 and #536.
Reproducibility with RNG
Knitr also caches
.Random.seed and it is restored before the evaluation of each chunk to maintain reproducibility of chunks which involve with random number generation (RNG). However, there is a problem here. Suppose chunk A and B have been cached; now if we insert a chunk C between A and B (all three chunks have RNG in them), in theory B should be updated because RNG modifies
.Random.seed as a side-effect, but in fact B will not be updated; in other words, the reproducibility of B is bogus.
To guarantee reproducibility with RNG, we need to associate
.Random.seed with cache; whenever it is modified, the chunk must be updated. It is easy to do so by using an unevaluated R expression in the
cache.extra option, e.g.
opts_chunk$set(cache.extra = rand_seed)
?rand_seed (it is an unevaluated R expression). In this case, each chunk will first check if
.Random.seed has been changed since the last run; a different
.Random.seed will force the current chunk to rebuild cache.