Introduction
State of the Field
This analysis provides a comparative benchmark of R packages designed for calculating standard and phylogenetic metrics of alpha and beta diversity. The primary objective is to evaluate their computational efficiency, with a focus on processing speed and memory allocation. Packages that rely on these foundational libraries as dependencies have been omitted from this study to isolate the performance of the core implementations.
| R Package | Classic alpha/beta | Phylogenetic alpha/beta |
|---|---|---|
| abdiv 0.2.0 | Serial R | none |
| adiv 2.2.1 | Serial R | Serial R |
| ampvis2 2.8.11 | vegan | Serial R |
| ecodist 2.1.3 | Serial C/R | none |
| ecodive 2.2.6 | Parallel C | Parallel C |
| entropart 1.6-16 | Serial R | none |
| GUniFrac 1.9 | none | Serial C |
| labdsv 2.3-1 | Serial FORTRAN | none |
| OmicFlow 1.5.1 | Parallel C++ | Parallel C++ |
| parallelDist 0.2.7 | Parallel C++ | none |
| philentropy 0.10.0 | Parallel C++ | none |
| phyloregion 1.0.9 | vegan | Serial R |
| phyloseq 1.56.0 | vegan | Parallel R |
| picante 1.8.2 | vegan | Serial R |
| tabula 3.3.2 | Serial R | none |
| vegan 2.7-3 | Serial C | none |
Methodology
The bench R package was employed to quantify the
computational runtime and memory allocation for the diversity algorithms
within each of the 15 selected packages. All benchmarks were executed on
a host system with the following hardware and software
configuration:
CPU: 6-Core Intel i5-9600K @ 3.70GHz
RAM: 64.0 GB
OS: Windows 11 Pro (64-bit, Version 25H2, Build 26200.8246)
Furthermore, the bench::mark() function was utilized to
verify that the outputs from all benchmarked expressions were
numerically equivalent, ensuring the consistency and comparability of
the results.
Setup
Two standard datasets from the rbiom R package,
hmp50 and gems, were selected for this
evaluation. The hmp50 dataset, which includes 50 samples
and an associated phylogenetic tree, was used to benchmark the
computationally intensive phylogenetic metrics, such as UniFrac and
Faith’s PD. For the traditional diversity metrics, which are
significantly less demanding, the larger gems dataset,
comprising 1,006 samples, was employed.
To account for the heterogeneous input and output formats across the 15 R packages, necessary data transformations were performed. To ensure that the benchmarks exclusively measured the performance of the diversity calculations, these data conversion steps were executed outside of the timed code blocks whenever possible.
Click to reveal R code.
install.packages('pak')
# Tools and Datasets for Benchmarking Report
pak::pkg_install(pkg = c(
'bench', 'dplyr', 'ggplot2', 'ggrepel', 'rbiom', 'svglite' ))
# Diversity Metric Implementations
pak::pkg_install(pkg = c(
'abdiv', 'adiv', 'ecodist', 'ecodive', 'entropart', 'GUniFrac',
'kasperskytte/ampvis2', 'labdsv', 'OmicFlow', 'parallelDist',
'philentropy', 'phyloregion', 'phyloseq', 'picante', 'tabula', 'vegan' ))
# Software Versions
version$version.string
#> [1] "R version 4.6.0 (2026-04-24 ucrt)"
data.frame(ver = sapply(FUN = packageDescription, fields = 'Version', c(
'bench', 'dplyr', 'ggplot2', 'ggrepel', 'rbiom',
'abdiv', 'adiv', 'ecodist', 'ecodive', 'entropart', 'GUniFrac',
'ampvis2', 'labdsv', 'OmicFlow', 'parallelDist', 'philentropy',
'phyloregion', 'phyloseq', 'picante', 'tabula', 'vegan' )))
#> ver
#> bench 1.1.4
#> dplyr 1.2.1
#> ggplot2 4.0.3
#> ggrepel 0.9.8
#> rbiom 3.1.0
#> abdiv 0.2.0
#> adiv 2.2.1
#> ecodist 2.1.3
#> ecodive 2.2.6
#> entropart 1.6-16
#> GUniFrac 1.9
#> ampvis2 2.8.11
#> labdsv 2.3-1
#> OmicFlow 1.5.1
#> parallelDist 0.2.7
#> philentropy 0.10.0
#> phyloregion 1.0.9
#> phyloseq 1.56.0
#> picante 1.8.2
#> tabula 3.3.2
#> vegan 2.7-3
library(bench)
library(ggplot2)
library(ggrepel)
library(dplyr)
library(Matrix)
n_cpus <- ecodive::n_cpus()
print(n_cpus)
#> [1] 6
# for philentropy
RcppParallel::setThreadOptions(numThreads = n_cpus)
# abdiv only accepts two samples at a time
pairwise <- function (f, data, ...) {
pairs <- utils::combn(nrow(data), 2)
structure(
mapply(
FUN = function (i, j) f(data[i,], data[j,], ...),
i = pairs[1,], j = pairs[2,] ),
class = 'dist',
Labels = rownames(data),
Size = nrow(data),
Diag = FALSE,
Upper = FALSE )
}
# Remove any extraneous attributes from dist objects,
# allowing them to be compared with `all.equal()`.
cleanup <- function (x) {
for (i in setdiff(names(attributes(x)), c('class', 'Labels', 'Size', 'Diag', 'Upper')))
attr(x, i) <- NULL
return (x)
}
# HMP50 dataset has 50 samples
hmp50 <- rbiom::hmp50
hmp50$counts <- hmp50$counts[hmp50$tree$tip.label,]
hmp50_phy <- rbiom::convert_to_phyloseq(hmp50)
hmp50_mtx_t <- as.matrix(hmp50)
hmp50_mtx <- t(hmp50_mtx_t)
hmp50_mtx_pt <- apply(hmp50_mtx, 1L, function (x) x / sum(x))
hmp50_mtx_p <- t(hmp50_mtx_pt)
hmp50_dgC <- as(hmp50_mtx, 'dgCMatrix')
hmp50_dgC_t <- as(hmp50_mtx_t, 'dgCMatrix')
hmp50_dgC_pt <- as(hmp50_mtx_pt, 'dgCMatrix')
hmp50_tree <- hmp50$tree
# GEMS dataset has 1006 samples
gems <- rbiom::gems
gems_mtx_t <- as.matrix(gems)
gems_mtx <- t(gems_mtx_t)
gems_mtx_pt <- apply(gems_mtx, 1L, function (x) x / sum(x))
gems_mtx_p <- t(gems_mtx_pt)
gems_dgC_t <- as(gems_mtx_t, 'dgCMatrix')
gems_dgC_pt <- as(gems_mtx_pt, 'dgCMatrix')
gems_bool <- gems_mtx > 0
## Bray-Curtis Dissimilarity
bray_curtis_res <- bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::bray_curtis, gems_mtx)),
'ecodist' = cleanup(ecodist::bcdist(gems_mtx)),
'ecodive' = cleanup(ecodive::bray(gems_dgC_t, margin = 2)),
'labdsv' = cleanup(labdsv::dsvdis(gems_mtx, 'bray/curtis')),
'OmicFlow' = cleanup(OmicFlow::bray(gems_dgC_t, threads=n_cpus)),
'parallelDist' = cleanup(parallelDist::parallelDist(gems_mtx, 'bray')),
'philentropy' = cleanup(philentropy::distance(gems_mtx, 'sorensen', test.na = FALSE, use.row.names = TRUE, as.dist.obj = TRUE, mute.message = TRUE)),
'tabula' = cleanup(1 - pairwise(tabula::index_bray, gems_mtx)),
'vegan' = cleanup(vegan::vegdist(gems_mtx, 'bray')) )
print(bray_curtis_res[,1:9] %>% arrange(median))
#> # A tibble: 9 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 ecodive 19.25ms 19.68ms 50.5 3.92MB 0 10 0 197.88ms
#> 2 philentropy 159.63ms 161.3ms 6.21 38.87MB 0 10 0 1.61s
#> 3 parallelDist 194.45ms 197.64ms 5.03 7.8MB 0 10 0 1.99s
#> 4 OmicFlow 228.11ms 231.81ms 4.17 28.97MB 0 10 0 2.4s
#> 5 ecodist 433.57ms 436.77ms 2.28 29.44MB 0 10 0 4.38s
#> 6 labdsv 830.26ms 848.67ms 1.08 68.12MB 0 10 0 9.22s
#> 7 vegan 1.97s 1.98s 0.503 16.45MB 0 10 0 19.88s
#> 8 abdiv 14.57s 15.2s 0.0655 20.43GB 0.472 10 72 2.54m
#> 9 tabula 27.66s 29.2s 0.0340 20.44GB 0.238 10 70 4.91m
## Jaccard Distance
jaccard_res <- bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::jaccard, gems_mtx)),
'ecodist' = cleanup(ecodist::distance(gems_mtx, 'jaccard')),
'ecodive' = cleanup(ecodive::jaccard(gems_dgC_t, margin = 2)),
'OmicFlow' = cleanup(OmicFlow::jaccard(gems_dgC_t, weighted = FALSE, threads=n_cpus)),
'parallelDist' = cleanup(parallelDist::parallelDist(gems_mtx, 'binary')),
'philentropy' = cleanup(philentropy::distance(gems_bool, 'jaccard', test.na = FALSE, use.row.names = TRUE, as.dist.obj = TRUE, mute.message = TRUE)),
'phyloregion' = cleanup(phyloregion::beta_diss(gems_mtx, 'jaccard')$beta.jac),
'stats' = cleanup(stats::dist(gems_mtx, 'binary')),
'vegan' = cleanup(vegan::vegdist(gems_mtx, 'jaccard', binary = TRUE)) )
print(jaccard_res[,1:9] %>% arrange(median))
#> # A tibble: 9 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 ecodive 16.77ms 17.28ms 57.5 3.87MB 0 10 0 173.9ms
#> 2 phyloregion 136.46ms 137.73ms 7.24 194.64MB 0 10 0 1.38s
#> 3 philentropy 175.83ms 177.32ms 5.58 41.64MB 0 10 0 1.79s
#> 4 parallelDist 314.2ms 316.97ms 3.15 7.71MB 0 10 0 3.18s
#> 5 OmicFlow 339.03ms 342.38ms 2.92 29.29MB 0 10 0 3.42s
#> 6 vegan 1.87s 1.9s 0.519 63.68MB 0 10 0 19.25s
#> 7 stats 2.88s 2.89s 0.346 7.73MB 0 10 0 28.91s
#> 8 abdiv 38.3s 1.28m 0.0120 24.89GB 0.0227 10 19 13.93m
#> 9 ecodist 2.06m 2.19m 0.00746 47.89GB 0.0738 10 99 22.36m
## Manhattan Distance
manhattan_res <- bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::manhattan, gems_mtx)),
'ecodist' = cleanup(ecodist::distance(gems_mtx, 'manhattan')),
'ecodive' = cleanup(ecodive::manhattan(gems_dgC_t, norm = 'none', margin = 2)),
'OmicFlow' = cleanup(OmicFlow::manhattan(gems_dgC_t, threads=n_cpus)),
'parallelDist' = cleanup(parallelDist::parallelDist(gems_mtx, 'manhattan')),
'philentropy' = cleanup(philentropy::distance(gems_mtx, 'manhattan', test.na = FALSE, use.row.names = TRUE, as.dist.obj = TRUE, mute.message = TRUE)),
'stats' = cleanup(stats::dist(gems_mtx, 'manhattan')),
'vegan' = cleanup(vegan::vegdist(gems_mtx, 'manhattan')) )
print(manhattan_res[,1:9] %>% arrange(median))
#> # A tibble: 8 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 ecodive 20.74ms 21.12ms 47.1 3.86MB 0 10 0 212.18ms
#> 2 philentropy 152.21ms 153.46ms 6.50 38.69MB 0 10 0 1.54s
#> 3 parallelDist 159.71ms 163.01ms 6.10 7.71MB 0 10 0 1.64s
#> 4 OmicFlow 218.88ms 221.43ms 4.52 28.97MB 0 10 0 2.21s
#> 5 vegan 1.7s 1.71s 0.585 13.5MB 0 10 0 17.11s
#> 6 stats 2.33s 2.34s 0.427 7.71MB 0 10 0 23.43s
#> 7 ecodist 33.88s 1.27m 0.0137 21.8GB 0.0603 10 44 12.17m
#> 8 abdiv 1.65m 2.21m 0.00751 17.52GB 0.00901 10 12 22.2m
## Euclidean Distance
euclidean_res <- bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::euclidean, gems_mtx)),
'ecodist' = cleanup(ecodist::distance(gems_mtx, 'euclidean')),
'ecodive' = cleanup(ecodive::euclidean(gems_dgC_t, norm = 'none', margin = 2)),
'parallelDist' = cleanup(parallelDist::parallelDist(gems_mtx, 'euclidean')),
'philentropy' = cleanup(philentropy::distance(gems_mtx, 'euclidean', test.na = FALSE, use.row.names = TRUE, as.dist.obj = TRUE, mute.message = TRUE)),
'stats' = cleanup(stats::dist(gems_mtx, 'euclidean')),
'vegan' = cleanup(vegan::vegdist(gems_mtx, 'euclidean')) )
print(euclidean_res[,1:9] %>% arrange(median))
#> # A tibble: 7 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 ecodive 20.66ms 21ms 47.3 3.86MB 0 10 0 211.44ms
#> 2 philentropy 151.73ms 154.51ms 6.45 38.69MB 0 10 0 1.55s
#> 3 parallelDist 160.26ms 161.71ms 6.16 7.71MB 0 10 0 1.62s
#> 4 vegan 1.7s 1.71s 0.584 13.5MB 0 10 0 17.13s
#> 5 stats 2.35s 2.35s 0.423 7.71MB 0 10 0 23.62s
#> 6 abdiv 1.38m 1.5m 0.0106 17.52GB 0.0255 10 24 15.67m
#> 7 ecodist 1.19m 1.56m 0.00946 21.8GB 0.0397 10 42 17.62m
## Shannon Diversity Index
shannon_res <- bench::mark(
iterations = 10,
'abdiv' = apply(gems_mtx_p, 1L, abdiv::shannon),
'adiv' = adiv::speciesdiv(gems_mtx_p, 'Shannon')[,1],
'ecodive' = ecodive::shannon(gems_dgC_pt, margin = 2),
'entropart' = apply(gems_mtx_p, 1L, entropart::Shannon, CheckArguments = FALSE),
'OmicFlow' = OmicFlow::diversity(gems_dgC_t, metric = 'shannon'),
'philentropy' = apply(gems_mtx_p, 1L, philentropy::H, unit = 'log'),
'tabula' = apply(gems_mtx_p, 1L, tabula::index_shannon),
'vegan' = vegan::diversity(gems_mtx_p, 'shannon') )
print(shannon_res[,1:9] %>% arrange(median))
#> # A tibble: 8 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 OmicFlow 2.67ms 2.74ms 363. 1.57MB 0 10 0 27.55ms
#> 2 ecodive 5.76ms 5.95ms 166. 15.97KB 0 10 0 60.19ms
#> 3 tabula 25.12ms 25.44ms 39.2 28.17MB 0 10 0 254.82ms
#> 4 entropart 36.97ms 37.44ms 26.7 41.77MB 0 10 0 374.71ms
#> 5 adiv 38.84ms 39.64ms 25.2 125.31MB 0 10 0 396.53ms
#> 6 vegan 61.02ms 61.38ms 16.3 68.08MB 0 10 0 614.84ms
#> 7 abdiv 71.28ms 71.72ms 13.7 95.38MB 0 10 0 731.2ms
#> 8 philentropy 1.28s 1.28s 0.780 52.67MB 0.780 5 5 6.41s
## Gini-Simpson Index
simpson_res <- bench::mark(
iterations = 10,
'abdiv' = apply(gems_mtx_p, 1L, abdiv::simpson),
'adiv' = adiv::speciesdiv(gems_mtx_p, 'GiniSimpson')[,1],
'ecodive' = ecodive::simpson(gems_dgC_pt, margin = 2),
'entropart' = apply(gems_mtx_p, 1L, entropart::Simpson, CheckArguments = FALSE),
'OmicFlow' = OmicFlow::diversity(gems_dgC_t, metric = 'simpson'),
'tabula' = 1 - apply(gems_mtx_p, 1L, tabula::index_simpson),
'vegan' = vegan::diversity(gems_mtx_p, 'simpson') )
print(simpson_res[,1:9] %>% arrange(median))
#> # A tibble: 7 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 OmicFlow 817.4µs 861.75µs 1081. 1.11MB 0 10 0 9.25ms
#> 2 ecodive 5.93ms 6.26ms 144. 14.56KB 0 10 0 69.48ms
#> 3 abdiv 16.89ms 17.01ms 58.8 32.64MB 0 10 0 169.99ms
#> 4 tabula 21.66ms 21.98ms 45.5 27.57MB 0 10 0 219.87ms
#> 5 adiv 31.14ms 31.29ms 30.8 50.24MB 0 10 0 324.17ms
#> 6 vegan 37.29ms 37.73ms 26.4 62.15MB 0 10 0 378.1ms
#> 7 entropart 38.98ms 39.28ms 25.5 47.54MB 0 10 0 392.49ms
## Faith's Phylogenetic Diversity
faith_res <- bench::mark(
iterations = 10,
check = FALSE, # entropart has incorrect output on non-ultrametric tree
'abdiv' = apply(hmp50_mtx, 1L, abdiv::faith_pd, hmp50_tree),
'adiv' = apply(hmp50_mtx, 1L, \(x) adiv::EH(hmp50_tree, colnames(hmp50_mtx)[x > 0])),
'ecodive' = ecodive::faith(hmp50_dgC_t, hmp50_tree, margin = 2),
'entropart' = apply(hmp50_mtx, 1L, entropart::PDFD, hmp50_tree, CheckArguments = FALSE),
'phyloregion' = phyloregion::PD(hmp50_mtx, hmp50_tree),
'picante' = as.matrix(picante::pd(hmp50_mtx, hmp50_tree))[,'PD'] )
print(faith_res[,1:9] %>% arrange(median))
#> # A tibble: 6 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 ecodive 2.62ms 2.83ms 346. 594.52KB 0 10 0 28.9ms
#> 2 phyloregion 13.92ms 14.43ms 68.0 1.54MB 0 10 0 147.1ms
#> 3 picante 91.96ms 92.7ms 10.7 69.57MB 0 10 0 934.6ms
#> 4 abdiv 1.19s 4.46s 0.251 651.56MB 0.0251 10 1 39.8s
#> 5 adiv 2.19s 5.34s 0.172 489.08MB 0.0172 10 1 58.3s
#> 6 entropart 43.55s 1.1m 0.0154 23.95GB 0.0508 10 33 10.8m
## Unweighted UniFrac
u_unifrac_res <- rbind(
local({
# cluster for phyloseq
cl <- parallel::makeCluster(n_cpus)
doParallel::registerDoParallel(cl)
on.exit(parallel::stopCluster(cl))
bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::unweighted_unifrac, hmp50_mtx, hmp50_tree)),
'ecodive' = cleanup(ecodive::unweighted_unifrac(hmp50_dgC_t, hmp50_tree, margin = 2)),
'GUniFrac' = cleanup(as.dist(GUniFrac::GUniFrac(hmp50_mtx, hmp50_tree, alpha=1, verbose=FALSE)[[1]][,,2])),
'OmicFlow' = cleanup(OmicFlow::unifrac(hmp50_dgC_pt, hmp50_tree, weighted=FALSE, threads=n_cpus)),
'phyloregion' = cleanup(phyloregion::unifrac(hmp50_dgC, hmp50_tree)),
'phyloseq' = cleanup(phyloseq::UniFrac(hmp50_phy, weighted=FALSE, normalized=FALSE, parallel=TRUE)),
'picante' = cleanup(picante::unifrac(hmp50_mtx, hmp50_tree)) )
}),
# ampvis2 conflicts with phyloseq cluster, so run separately
local({
bench::mark(
iterations = 10,
'ampvis2' = {
cleanup(ampvis2:::dist.unifrac(hmp50_mtx_t, hmp50_tree, weighted=FALSE, normalise=FALSE, num_threads=n_cpus))
doParallel::stopImplicitCluster() } )
})
)
print(u_unifrac_res[,1:9] %>% arrange(median))
#> # A tibble: 8 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 ecodive 6.22ms 6.89ms 115. 697.21KB 0 10 0 86.61ms
#> 2 phyloregion 6.61ms 6.92ms 145. 150.43MB 0 10 0 69ms
#> 3 OmicFlow 6.69ms 6.98ms 141. 4.05MB 0 10 0 70.84ms
#> 4 GUniFrac 82.55ms 84.11ms 10.6 114.09MB 2.12 10 2 943.09ms
#> 5 phyloseq 311.95ms 317.52ms 3.02 50.87MB 0.302 10 1 3.31s
#> 6 picante 2.51s 2.58s 0.381 1.8GB 1.22 10 32 26.22s
#> 7 ampvis2 3.52s 3.61s 0.277 54.18MB 0.0308 9 1 32.44s
#> 8 abdiv 14.45s 14.79s 0.0655 20.06GB 2.46 10 375 2.54m
## Weighted UniFrac
w_unifrac_res <- rbind(
local({
# cluster for phyloseq
cl <- parallel::makeCluster(n_cpus)
doParallel::registerDoParallel(cl)
on.exit(parallel::stopCluster(cl))
bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::weighted_unifrac, hmp50_mtx, hmp50_tree)),
'ecodive' = cleanup(ecodive::weighted_unifrac(hmp50_dgC_t, hmp50_tree, margin=2)),
'OmicFlow' = cleanup(OmicFlow::unifrac(hmp50_dgC_pt, hmp50_tree, weighted=TRUE, normalized=FALSE, threads=n_cpus)),
'phyloseq' = cleanup(phyloseq::UniFrac(hmp50_phy, weighted=TRUE, normalized=FALSE, parallel=TRUE)) )
}),
# ampvis2 conflicts with phyloseq cluster, so run separately
local({
bench::mark(
iterations = 10,
'ampvis2' = {
cleanup(ampvis2:::dist.unifrac(hmp50_mtx_t, hmp50_tree, weighted=TRUE, normalise=FALSE, num_threads=n_cpus))
doParallel::stopImplicitCluster() } )
})
)
print(w_unifrac_res[,1:9] %>% arrange(median))
#> # A tibble: 5 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 OmicFlow 4.09ms 4.31ms 228. 98.7KB 0 10 0 43.83ms
#> 2 ecodive 6.02ms 6.25ms 160. 117.4KB 0 10 0 62.66ms
#> 3 phyloseq 302.25ms 305.79ms 3.24 49.4MB 0.324 10 1 3.09s
#> 4 ampvis2 3.47s 3.51s 0.278 49.2MB 0.0309 9 1 32.33s
#> 5 abdiv 14.33s 14.39s 0.0661 20GB 2.06 10 311 2.52m
## Weighted Normalized UniFrac
wn_unifrac_res <- rbind(
local({
# cluster for phyloseq
cl <- parallel::makeCluster(n_cpus)
doParallel::registerDoParallel(cl)
on.exit(parallel::stopCluster(cl))
bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::weighted_normalized_unifrac, hmp50_mtx, hmp50_tree)),
'ecodive' = cleanup(ecodive::normalized_unifrac(hmp50_dgC_t, hmp50_tree, margin=2)),
'GUniFrac' = cleanup(as.dist(GUniFrac::GUniFrac(hmp50_mtx, hmp50_tree, alpha=1, verbose=FALSE)[[1]][,,1])),
'OmicFlow' = cleanup(OmicFlow::unifrac(hmp50_dgC_pt, hmp50_tree, weighted=TRUE, normalized=TRUE, threads=n_cpus)),
'phyloseq' = cleanup(phyloseq::UniFrac(hmp50_phy, weighted=TRUE, normalized=TRUE, parallel=TRUE)) )
}),
# ampvis2 conflicts with phyloseq cluster, so run separately
local({
bench::mark(
iterations = 10,
'ampvis2' = {
cleanup(ampvis2:::dist.unifrac(hmp50_mtx_t, hmp50_tree, weighted=TRUE, normalise=TRUE, num_threads=n_cpus))
doParallel::stopImplicitCluster() } )
})
)
print(wn_unifrac_res[,1:9] %>% arrange(median))
#> # A tibble: 6 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 OmicFlow 4.56ms 5.2ms 185. 98.7KB 0 10 0 53.95ms
#> 2 ecodive 6.01ms 6.37ms 156. 117.4KB 0 10 0 63.97ms
#> 3 GUniFrac 80.62ms 83.63ms 11.2 92.1MB 1.12 10 1 895.17ms
#> 4 phyloseq 302.16ms 318.88ms 3.07 49.5MB 0.307 10 1 3.25s
#> 5 ampvis2 3.57s 3.69s 0.264 49.3MB 0.0293 9 1 34.14s
#> 6 abdiv 14.08s 14.37s 0.0662 20GB 1.38 10 208 2.52m
## Generalized UniFrac
g_unifrac_res <- rbind(
local({
# cluster for phyloseq
cl <- parallel::makeCluster(n_cpus)
doParallel::registerDoParallel(cl)
on.exit(parallel::stopCluster(cl))
bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::generalized_unifrac, hmp50_mtx, hmp50_tree, alpha=0.5)),
'ecodive' = cleanup(ecodive::generalized_unifrac(hmp50_dgC_t, hmp50_tree, margin=2, alpha=0.5)),
'GUniFrac' = cleanup(as.dist(GUniFrac::GUniFrac(hmp50_mtx, hmp50_tree, alpha=0.5, verbose=FALSE)[[1]][,,1])) )
})
)
print(g_unifrac_res[,1:9] %>% arrange(median))
#> # A tibble: 3 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 ecodive 6.54ms 7.26ms 119. 129.5KB 0 10 0 83.72ms
#> 2 GUniFrac 79.56ms 80.82ms 11.6 92.1MB 1.16 10 1 864.08ms
#> 3 abdiv 14.24s 14.76s 0.0664 20.2GB 1.37 10 207 2.51m
## Variance Adjusted UniFrac
va_unifrac_res <- rbind(
local({
# cluster for phyloseq
cl <- parallel::makeCluster(n_cpus)
doParallel::registerDoParallel(cl)
on.exit(parallel::stopCluster(cl))
bench::mark(
iterations = 10,
'abdiv' = cleanup(pairwise(abdiv::variance_adjusted_unifrac, hmp50_mtx, hmp50_tree)),
'ecodive' = cleanup(ecodive::variance_adjusted_unifrac(hmp50_dgC_t, hmp50_tree, margin=2)) )
})
)
print(va_unifrac_res[,1:9] %>% arrange(median))
#> # A tibble: 2 × 9
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 ecodive 6.03ms 6.29ms 159. 117.4KB 0 10 0 62.85ms
#> 2 abdiv 16.12s 16.25s 0.0583 24.5GB 1.15 10 197 2.86m
## Data frame for Figure 1A
Fig1A_data <- bind_rows(
mutate(shannon_res, Metric = 'shannon'),
mutate(bray_curtis_res, Metric = 'bray'),
mutate(faith_res, Metric = 'faith') ) %>%
mutate(Package = as.character(expression)) %>%
select(Package, Metric, median, mem_alloc) %>%
arrange(Metric, median)
Fig1A_data$Title <- factor(
x = Fig1A_data$Metric,
levels = c("shannon", "bray", "faith"),
labels = c(
"**Shannon Diversity Index**<br>1006 samples",
"**Bray-Curtis Dissimilarity**<br>1006 samples",
"**Faith's Phylogenetic Diversity**<br>50 samples") )
## ggplot for Figure 1A
Fig1A <- ggplot(Fig1A_data, aes(x = median, y = mem_alloc)) +
facet_wrap('Title', nrow = 1, scales = 'free_x') +
geom_point() +
geom_label_repel(aes(
label = Package,
alpha = ifelse(Package == 'ecodive', 1, 0.6),
fontface = ifelse(Package == 'ecodive', 2, 1),
size = ifelse(Package == 'ecodive', 3.2, 2.6) ),
box.padding = .4,
point.padding = .5,
max.overlaps = Inf,
min.segment.length = 0 ) +
scale_alpha_identity(guide = 'none') +
scale_size_identity(guide = 'none') +
bench::scale_x_bench_time() +
scale_y_log10(labels = scales::label_bytes()) +
labs(
tag = 'A',
x = 'Median Calculation Time (log scale; n=10)',
y = 'Memory Allocated\n(log scale; n=1)' ) +
theme_bw(base_size = 12) +
theme(
strip.text = ggtext::element_markdown(hjust = 0, lineheight = 1.2),
axis.title.x = element_text(margin = margin(t = 10)))
## Data frame for Figure 1B
Fig1B_data <- bind_rows(
mutate(u_unifrac_res, `UniFrac Variant` = 'Unweighted'),
mutate(w_unifrac_res, `UniFrac Variant` = 'Weighted'),
mutate(wn_unifrac_res, `UniFrac Variant` = 'Weighted Normalized'),
mutate(g_unifrac_res, `UniFrac Variant` = 'Generalized'),
mutate(va_unifrac_res, `UniFrac Variant` = 'Variance Adjusted') ) %>%
mutate(Package = as.character(expression)) %>%
select(Package, `UniFrac Variant`, median, mem_alloc) %>%
arrange(Package)
Fig1B_data$Title <- "**UniFrac Family**<br>50 samples"
Fig1B_data$`UniFrac Variant` <- factor(
x = Fig1B_data$`UniFrac Variant`,
levels = c("Unweighted", "Weighted", "Weighted Normalized", "Generalized", "Variance Adjusted") )
## ggplot for Figure 1B
Fig1B <- ggplot(Fig1B_data, aes(x = median, y = mem_alloc)) +
facet_wrap('Title') +
geom_point(aes(shape = `UniFrac Variant`), size = 2) +
geom_label_repel(aes(
label = Package,
alpha = ifelse(Package == 'ecodive', 1, 0.6),
fontface = ifelse(Package == 'ecodive', 2, 1),
size = ifelse(Package == 'ecodive', 3.2, 2.6) ),
data = ~subset(., `UniFrac Variant` == 'Unweighted'),
direction = 'y',
box.padding = 0.4,
point.padding = 10,
min.segment.length = Inf ) +
scale_shape(solid = FALSE) +
scale_alpha_identity(guide = 'none') +
scale_size_identity(guide = 'none') +
bench::scale_x_bench_time(limits = c(.002, NA)) +
scale_y_log10(labels = scales::label_bytes()) +
labs(
tag = 'B',
x = 'Median Calculation Time (log scale; n=10)',
y = 'Memory Allocated\n(log scale; n=1)' ) +
theme_bw(base_size = 12) +
theme(
strip.text = ggtext::element_markdown(hjust = 0, lineheight = 1.2),
axis.title.x = element_text(margin = margin(t = 10)) )
## Combined Figure 1
Fig1 <- patchwork::wrap_plots(
Fig1A,
Fig1B,
patchwork::guide_area(),
design = "111\n223",
guides = 'collect' )Results
print(Fig1)How much faster and more memory efficient is ecodive?
plyr::ddply(Fig1A_data, as.name('Metric'), function (x) {
x[['median']] <- as.numeric(x[['median']])
x[['mem_alloc']] <- as.numeric(x[['mem_alloc']])
ecodive <- as.list(x[x[['Package']] == 'ecodive',])
x <- x[x[['Package']] != 'ecodive',]
data.frame(
speed = paste0(paste(collapse=' - ', round(range(x$median / ecodive$median))), 'x'),
memory = paste0(paste(collapse=' - ', round(range(x$mem_alloc / ecodive$mem_alloc))), 'x') )
})
#> Metric speed memory
#> 1 bray 8 - 1368x 2 - 5232x
#> 2 faith 1 - 14022x 119 - 41624x
#> 3 shannon 0 - 235x 96 - 7963x
plyr::ddply(Fig1B_data, as.name('UniFrac Variant'), function (x) {
x[['median']] <- as.numeric(x[['median']])
x[['mem_alloc']] <- as.numeric(x[['mem_alloc']])
ecodive <- as.list(x[x[['Package']] == 'ecodive',])
x <- x[x[['Package']] != 'ecodive',]
data.frame(
speed = paste0(paste(collapse=' - ', round(range(x$median / ecodive$median))), 'x'),
memory = paste0(paste(collapse=' - ', round(range(x$mem_alloc / ecodive$mem_alloc))), 'x') )
})
#> UniFrac Variant speed memory
#> 1 Unweighted 1 - 2147x 6 - 30163x
#> 2 Weighted 1 - 2302x 1 - 178809x
#> 3 Weighted Normalized 1 - 2257x 1 - 178890x
#> 4 Generalized 11 - 2033x 729 - 163456x
#> 5 Variance Adjusted 2584 - 2584x 218449 - 218449xRaw Data for Figure 1 Panel A
print(Fig1A_data[,1:4], n = Inf)
#> # A tibble: 23 × 4
#> Package Metric median mem_alloc
#> <chr> <chr> <bch:tm> <bch:byt>
#> 1 ecodive bray 19.34ms 4MB
#> 2 philentropy bray 159.94ms 38.87MB
#> 3 parallelDist bray 193.92ms 7.8MB
#> 4 OmicFlow bray 229.82ms 28.98MB
#> 5 ecodist bray 434.13ms 29.44MB
#> 6 labdsv bray 849.94ms 68.16MB
#> 7 vegan bray 1.97s 16.45MB
#> 8 abdiv bray 14.63s 20.44GB
#> 9 tabula bray 26.47s 20.44GB
#> 10 ecodive faith 2.05ms 603.3KB
#> 11 phyloregion faith 3.06ms 146.19MB
#> 12 picante faith 82.04ms 69.86MB
#> 13 abdiv faith 567.87ms 651.64MB
#> 14 adiv faith 1.94s 489.12MB
#> 15 entropart faith 28.76s 23.95GB
#> 16 OmicFlow shannon 2.32ms 1.57MB
#> 17 ecodive shannon 5.73ms 16.71KB
#> 18 tabula shannon 26.02ms 28.17MB
#> 19 entropart shannon 38.98ms 41.79MB
#> 20 adiv shannon 41.17ms 129.96MB
#> 21 vegan shannon 63.01ms 68.08MB
#> 22 abdiv shannon 73.23ms 95.38MB
#> 23 philentropy shannon 1.35s 52.67MBRaw Data for Figure 1 Panel B
print(Fig1B_data[,1:4], n = Inf)
#> # A tibble: 24 × 4
#> Package `UniFrac Variant` median mem_alloc
#> <chr> <fct> <bch:tm> <bch:byt>
#> 1 GUniFrac Unweighted 84.11ms 114.09MB
#> 2 GUniFrac Weighted Normalized 83.63ms 92.14MB
#> 3 GUniFrac Generalized 80.82ms 92.14MB
#> 4 OmicFlow Unweighted 6.98ms 4.05MB
#> 5 OmicFlow Weighted 4.31ms 98.66KB
#> 6 OmicFlow Weighted Normalized 5.2ms 98.66KB
#> 7 abdiv Unweighted 14.79s 20.06GB
#> 8 abdiv Weighted 14.39s 20.02GB
#> 9 abdiv Weighted Normalized 14.37s 20.03GB
#> 10 abdiv Generalized 14.76s 20.19GB
#> 11 abdiv Variance Adjusted 16.25s 24.46GB
#> 12 ampvis2 Unweighted 3.61s 54.18MB
#> 13 ampvis2 Weighted 3.51s 49.19MB
#> 14 ampvis2 Weighted Normalized 3.69s 49.34MB
#> 15 ecodive Unweighted 6.89ms 697.21KB
#> 16 ecodive Weighted 6.25ms 117.42KB
#> 17 ecodive Weighted Normalized 6.37ms 117.42KB
#> 18 ecodive Generalized 7.26ms 129.49KB
#> 19 ecodive Variance Adjusted 6.29ms 117.42KB
#> 20 phyloregion Unweighted 6.92ms 150.43MB
#> 21 phyloseq Unweighted 317.52ms 50.87MB
#> 22 phyloseq Weighted 305.79ms 49.39MB
#> 23 phyloseq Weighted Normalized 318.88ms 49.54MB
#> 24 picante Unweighted 2.58s 1.8GBNumber of Dependencies per Package
data.frame(deps = sapply(
FUN = function (pkg) { nrow(pak::pkg_deps(pkg)) - 1 },
X = c(
'abdiv', 'adiv', 'ecodist', 'ecodive', 'entropart', 'GUniFrac',
'kasperskytte/ampvis2', 'labdsv', 'OmicFlow', 'parallelDist',
'philentropy', 'phyloregion', 'phyloseq', 'picante', 'tabula', 'vegan' )))
#> deps
#> abdiv 5
#> adiv 97
#> ecodist 10
#> ecodive 0
#> entropart 82
#> GUniFrac 47
#> kasperskytte/ampvis2 75
#> labdsv 8
#> OmicFlow 90
#> parallelDist 2
#> philentropy 4
#> phyloregion 64
#> phyloseq 54
#> picante 11
#> tabula 2
#> vegan 7Metrics per Package
The following prompt was provided to Google Gemini 3’s “Thinking” model for each package:
I am analyzing the [PACKAGE_NAME] R package to compare the number of ecologically relevant alpha and beta diversity metrics it provides. The package's reference manual is attached.
Please act as an expert in ecological bioinformatics. Carefully read the manual and adhere to these rules:
1. DEFINITION OF A SUITABLE METRIC:
- ECOLOGICAL RELEVANCE: Only include metrics used to quantify community diversity. This includes measures of richness, evenness, dominance, turnover, or nestedness.
- SYMMETRY REQUIREMENT: Strictly ignore any asymmetric metrics (where $d(a, b) \neq d(b, a)$). All included beta diversity metrics must be symmetric.
- EXCLUSIONS: Do NOT count accessory functions, data manipulation, plotting functions, or general statistical correlations (e.g., Pearson/Spearman) unless they are explicitly implemented as a community diversity or dissimilarity index.
2. DISTINCTNESS RULE:
- ARGUMENTS AS METRICS: If a single function computes different named metrics based on a character argument (e.g., `method="bray"` vs `method="jaccard"`), each valid method counts as a SEPARATE metric.
- NAMED METRIC SELECTORS: If a numeric parameter (like q, alpha, or p) is used to dial in specific named indices (e.g., q=0 for Richness, q=1 for Shannon, q=2 for Simpson, or phylogenetic equivalents like Faith's PD, Allen's H, and Rao's Q), count each named case as a distinct metric. Do not count intermediate or arbitrary numeric values.
- THE BINARY TOGGLE: Treat the presence-absence version and abundance-based version of the same metric as ONE entity, unless they are provided as separate functions or separate named methods in a list.
- UNIFRAC EXCEPTION: Count unweighted, weighted, weighted normalized, generalized, and variance-adjusted UniFrac as separate metrics if explicitly supported.
3. CATEGORIZATION:
- ALPHA DIVERSITY: Metrics that quantify the diversity of a single sample or site (inventory diversity).
- BETA DIVERSITY: Metrics that quantify the difference or dissimilarity between a pair of samples (differentiation diversity).
To ensure accuracy, format your response as follows:
- **Candidate Extraction:** List all potential diversity functions and their associated methods/parameters found in the manual.
- **Evaluation:** Filter the candidates against the "Symmetry," "Ecological Relevance," and "Distinctness" rules.
- **Categorized Metric List:** Provide a table or list of all validated metrics with two columns: [Metric Name] and [Category (Alpha or Beta)].
- **Final Summary:**
- **Total Alpha Diversity Metrics:** [Integer]
- **Total Beta Diversity Metrics:** [Integer]
- **UniFrac Metrics Present:** [Yes/No]
| R Package | Alpha | Beta | Unifrac |
|---|---|---|---|
abdiv |
17 | 48 | \checkmark |
adiv |
42 | 58 | - |
ampvis2 |
6 | 25 | \checkmark |
ecodist |
0 | 9 | - |
ecodive |
14 | 36 | \checkmark |
entropart |
11 | 9 | - |
GUniFrac |
0 | 4 | \checkmark |
labdsv |
2 | 7 | - |
OmicFlow |
6 | 9 | \checkmark |
parallelDist |
0 | 26 | - |
philentropy |
1 | 43 | - |
phyloregion |
9 | 10 | \checkmark |
phyloseq |
7 | 43 | \checkmark |
picante |
14 | 9 | \checkmark |
tabula |
16 | 12 | - |
vegan |
17 | 52 | - |
