Parallel and Memory-Efficient Ecological Diversity Metrics • ecodive

ecodive is an R package for calculating ecological diversity metrics in a parallelized and memory-efficient manner. It is designed to handle large datasets, such as those common in microbiome research.

Why ecodive?

Analyzing ecological diversity is often a computational bottleneck, especially with large datasets. ecodive addresses this by providing:

High Performance: ecodive is written in C and parallelized using pthreads, making it dramatically faster than other R packages. Benchmarks show it can be up to 9,000x faster and use up to 200,000x less memory.
Zero Dependencies: The package has no external R dependencies, making it lightweight, stable, and easy to install. This also makes it an ideal and secure backend for other R packages.
Comprehensive Metrics: It implements a wide range of common alpha and beta diversity metrics, including classic and phylogenetic-aware methods like Faith’s PD and the complete UniFrac family.
Ease of Use: The API is simple and integrates seamlessly with popular bioinformatics packages like phyloseq and rbiom.

Installation

The latest stable version can be installed from CRAN.

install.packages('ecodive')

The development version is available on GitHub.

install.packages('pak')
pak::pak('cmmr/ecodive')

Usage

ecodive functions are straightforward to use. Here are a few examples.

With `phyloseq` or `rbiom` objects

The easiest way to use ecodive is with a phyloseq or rbiom object. These objects conveniently bundle the count data and phylogenetic tree.

library(ecodive)
data(esophagus, package = 'phyloseq')
data(hmp50,     package = 'rbiom')

# Calculate weighted UniFrac distance
w_unifrac <- weighted_unifrac(esophagus)
print(w_unifrac)
#>           B         C
#> C 0.1050480          
#> D 0.1401124 0.1422409

# Calculate Faith's Phylogenetic Diversity
faith_pd <- faith(hmp50)
print(faith_pd[1:4])
#>   HMP01   HMP02   HMP03   HMP04 
#> 6.22296 8.59432 8.93375 9.86597

With basic R objects

You can also provide the count data and phylogenetic tree as separate objects.

The ex_counts and ex_tree objects are included with ecodive.

## Example Data ----------------------
counts <- rarefy(ex_counts)
counts[,1:4]
#>        Streptococcus Bacteroides Corynebacterium Haemophilus
#> Saliva           162           2               0         180
#> Gums             309           2               0          34
#> Nose               6           0             171           0
#> Stool              1         341               1           1


## Alpha Diversity -------------------
shannon(counts)
#>     Saliva       Gums       Nose      Stool 
#> 0.74119910 0.35692121 1.10615349 0.07927797 

faith(counts, tree = ex_tree)
#> Saliva   Gums   Nose  Stool 
#>    180    155    101    202 


## Beta Diversity --------------------
bray(counts)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose  0.9797101 0.9826087          
#> Stool 0.9884058 0.9884058 0.9913043

weighted_unifrac(counts, tree = ex_tree)
#>          Saliva      Gums      Nose
#> Gums   36.97681                    
#> Nose   67.23768  55.97101          
#> Stool 109.77971 109.44058 110.00870

Available Methods

Use list_metrics() to browse the metrics available for calculating diversity.

# Alpha Diversity
list_metrics('alpha', 'id')
#>  [1] "ace"         "berger"      "brillouin"   "chao1"       "faith"      
#>  [6] "fisher"      "simpson"     "inv_simpson" "margalef"    "mcintosh"   
#> [11] "menhinick"   "observed"    "shannon"     "squares"    

# Beta Diversity
list_metrics('beta', 'id')
#>  [1] "aitchison"                 "bhattacharyya"             "bray"                     
#>  [4] "canberra"                  "chebyshev"                 "chord"                    
#>  [7] "clark"                     "sorensen"                  "divergence"               
#> [10] "euclidean"                 "generalized_unifrac"       "gower"                    
#> [13] "hamming"                   "hellinger"                 "horn"                     
#> [16] "jaccard"                   "jensen"                    "jsd"                      
#> [19] "lorentzian"                "manhattan"                 "matusita"                 
#> [22] "minkowski"                 "morisita"                  "motyka"                   
#> [25] "normalized_unifrac"        "ochiai"                    "psym_chisq"               
#> [28] "soergel"                   "squared_chisq"             "squared_chord"            
#> [31] "squared_euclidean"         "topsoe"                    "unweighted_unifrac"       
#> [34] "variance_adjusted_unifrac" "wave_hedges"               "weighted_unifrac"

Documentation

The online manual for ecodive is available at https://cmmr.github.io/ecodive/. It includes a getting started guide, articles on alpha/beta diversity, and reference pages for each function.

Testing

ecodive has a comprehensive test suite using the testthat package. All code is automatically tested via GitHub Actions upon each push to the repository to ensure functionality and stability.

For users who have cloned the source repository, the tests can be run locally with the following commands in an R session:

install.packages('testthat')
devtools::test()

Community guidelines

Support

Bug reports, feature requests, and general questions can be submitted at https://github.com/cmmr/ecodive/issues.

Contributing

Pull requests are welcome. Please ensure contributed code is covered by tests and documentation (add additional tests and documentation as needed) and passes all automated tests.

New functions must leverage C and pthreads to minimize memory and CPU time.

Please note that the ecodive project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

ecodive