Speed Ups. — speed • rbiom

When working with very large datasets, you can make use of these tips and tricks to speed up operations on rbiom objects.

Skip Cloning

Functions that modify rbiom objects, like subset() and rarefy(), will automatically clone the object before modifying it. This is to make these functions behave as most R users would expect - but at a performance trade off.

Rather than:

biom <- subset(biom, ...)
biom <- rarefy(biom)

Modify biom in place like this:

subset(biom, clone = FALSE, ...)
rarefy(biom, clone = FALSE)

# Or:
biom$metadata %<>% subset(...)
biom$counts %<>% rarefy_cols()

Drop Components

Sequences

Reference sequences for OTUs will be imported along with the rest of your dataset and stored in $sequences. However, rbiom doesn't currently use these sequences for anything (except writing them back out with write_biom() or write_fasta()).

You can delete them from your rbiom object with:

biom$sequences <- NULL

Tree

The phylogenetic reference tree for OTUs is only used for calculating UniFrac distances. If you aren't using UniFrac, the tree can be dropped from the rbiom object with:

biom$tree <- NULL

Alternatively, you can store the tree separately from the rbiom object and provide it to just the functions that use it. For example:

tree <- biom$tree
biom$tree <- NULL
dm <- bdiv_distmat(biom, 'unifrac', tree = tree)

Increase Caching

Caching is enabled by default - up to 20 MB per R session.

For large datasets, increasing the cache size can help. The size is specified in bytes by an R option or environment variable.

options(rbiom.cache_size=200 * 1024 ^ 2) # 200 MB
Sys.setenv(RBIOM_CACHE_SIZE=1024 ^ 3)    # 1 GB

You can also specify a cache directory where results can be preserved from one R session to the next.

options(rbiom.cache_dir=tools::R_user_dir("rbiom", "cache"))
Sys.setenv(RBIOM_CACHE_DIR="~/rbiom_cache")

Other quick notes about caching:

Setting the cache directory to "FALSE" will disable caching.
R options will override environment variables.
The key hash algorithm can be set with options(rbiom.cache_hash=rlang::hash).

Summary Layers

The figure-generating functions allow you to display every data point. However, when you have thousands of data points, rendering every single one can be slow. Instead, set the layers parameter to use other options.

adiv_boxplot(biom, layers = "bl")  # bar, linerange
adiv_corrplot(biom, layers = "tc") # trend, confidence
bdiv_ord_plot(biom, layers = "e")  # ellipse