Skip to contents

Input Matrix

We will use the ex_counts dataset included with ecodive. This feature table contains counts of bacterial genera across various samples.

library(ecodive)

counts <- rarefy(ex_counts)
t(counts)
#>                   Saliva Gums Nose Stool
#> Streptococcus        162  309    6     1
#> Bacteroides            2    2    0   341
#> Corynebacterium        0    0  171     1
#> Haemophilus          180   34    0     1
#> Propionibacterium      1    0   82     0
#> Staphylococcus         0    0   86     1

Alpha Diversity

Alpha diversity measures diversity within a single sample. In ecodive, metrics are grouped into four categories based on the aspect of diversity they quantify.

Richness Metrics

Richness metrics estimate the number of distinct features (e.g., genera) in a sample. The simplest metric, observed(), counts features with non-zero abundance.

# Equivalent to rowSums(counts > 0)
observed(counts)
#> Saliva   Gums   Nose  Stool 
#>      4      3      4      5 

The Chao1 estimator extends this by inferring the number of unobserved, low-abundance features based on the ratio of singletons (counts == 1) to doubletons (counts == 2).

# Infers 8 unobserved genera
chao1(c(1, 1, 1, 1, 2, 5, 5, 5))
#> [1] 16

# Infers less than 1 unobserved genera
chao1(c(1, 2, 2, 2, 2, 5, 5, 5))
#> [1] 8.125

# Datasets without 1s and 2s give Inf or NaN
chao1(counts)
#> Saliva   Gums   Nose  Stool 
#>    4.5    3.0    NaN    Inf 

Diversity Metrics

Diversity metrics account for both richness and evenness (how equally abundances are distributed).

Simpson’s index is often used as a measure of evenness, representing the probability that two randomly selected individuals belong to different species.

# High Evenness (0.8) vs Low Evenness (0.07)
simpson(c(20, 20, 20, 20, 20))
#> [1] 0.8
simpson(c(100, 1, 1, 1, 1))
#> [1] 0.07507396

# Stool < Gums < Saliva < Nose
sort(simpson(counts))
#>      Stool       Gums     Saliva       Nose 
#> 0.02302037 0.18806133 0.50725478 0.63539593 

The Shannon diversity index (entropy) is another common metric that weights both richness and evenness.

# High richness, High evenness
shannon(rep(100, 100))
#> [1] 4.60517

# Stool < Gums < Saliva < Nose
sort(shannon(counts))
#>      Stool       Gums     Saliva       Nose 
#> 0.07927797 0.35692121 0.74119910 1.10615349 

Dominance Metrics

Dominance metrics focus on the abundance of the most common species. The Berger-Parker index is the proportional abundance of the single most abundant feature.

# Stool is dominated by Bacteroides (341/345 counts -> ~0.99)
# Nose is more balanced; Corynebacterium is max (171/345 counts -> ~0.49)
sort(berger(counts))
#>      Nose    Saliva      Gums     Stool 
#> 0.4956522 0.5217391 0.8956522 0.9884058 

Phylogenetic Metrics

Phylogenetic metrics use a phylogenetic tree to incorporate evolutionary distance. Faith’s Phylogenetic Diversity (PD) calculates the total branch length spanned by the features present in a sample.

# ex_tree:
#
#       +----------44---------- Haemophilus
#   +-2-|
#   |   +----------------68---------------- Bacteroides  
#   |                      
#   |             +---18---- Streptococcus
#   |      +--12--|       
#   |      |      +--11-- Staphylococcus
#   +--11--|              
#          |      +-----24----- Corynebacterium
#          +--12--|
#                 +--13-- Propionibacterium


faith(c(Propionibacterium = 1, Corynebacterium = 1), tree = ex_tree)
#> [1] 60

faith(c(Propionibacterium = 1, Haemophilus = 1), tree = ex_tree)
#> [1] 82

# Nose < Gums < Saliva < Stool
sort(faith(counts, tree = ex_tree))
#>   Nose   Gums Saliva  Stool 
#>    101    155    180    202 

Formulas

Given:

  • n : Number of features (e.g. species, OTUs, ASVs).
  • X_i : Integer count of the i-th feature.
  • X_T : Total of all counts (sequencing depth). X_T = \sum_{i=1}^{n} X_i
  • P_i : Proportional abundance of the i-th feature. P_i = X_i / X_T
  • F_1 : Number of singletons (X_i = 1).
  • F_2 : Number of doubletons (X_i = 2).
Metric Formula
Abundance-based Coverage Estimator (ACE) See below.
Berger-Parker Index \max(P_i)
Brillouin Index \displaystyle \frac{\ln{[(\sum_{i = 1}^{n} X_i)!]} - \sum_{i = 1}^{n} \ln{(X_i!)}}{\sum_{i = 1}^{n} X_i}
Chao1 \displaystyle n + \frac{(F_1)^2}{2 F_2}
Faith’s Phylogenetic Diversity See below.
Fisher’s Alpha (\alpha) \displaystyle \frac{n}{\alpha} = \ln{\left(1 + \frac{X_T}{\alpha}\right)}
(\alpha is solved for iteratively)
Gini-Simpson Index 1 - \sum_{i = 1}^{n} P_i^2
Inverse Simpson Index 1 / \sum_{i = 1}^{n} P_i^2
Margalef’s Richness Index \displaystyle \frac{n - 1}{\ln{X_T}}
McIntosh Index \displaystyle \frac{X_T - \sqrt{\sum_{i = 1}^{n} (X_i)^2}}{X_T - \sqrt{X_T}}
Menhinick’s Richness Index \displaystyle \frac{n}{\sqrt{X_T}}
Observed Features n
Shannon Diversity Index -\sum_{i = 1}^{n} P_i \times \ln(P_i)
Squares Richness Estimator \displaystyle n + \frac{(F_1)^2 \sum_{i=1}^{n} (X_i)^2}{X_T^2 - nF_1}

Abundance-based Coverage Estimator (ACE)

Given:

  • r : Rare cutoff (features with \le r counts are considered rare).
  • F_{rare} : Number of rare features.
  • F_{abund} : Number of abundant features (> r counts).
  • X_{rare} : Total counts belonging to rare features.
  • C_{ace} : Sample abundance coverage estimator.
  • \gamma_{ace}^2 : Estimated coefficient of variation.

C_{ace} = 1 - \frac{F_1}{X_{rare}}

\gamma_{ace}^2 = \max\left[\frac{F_{rare} \sum_{i=1}^{r}i(i-1)F_i}{C_{ace}X_{rare}(X_{rare} - 1)} - 1, 0\right]

D_{ace} = F_{abund} + \frac{F_{rare}}{C_{ace}} + \frac{F_1}{C_{ace}}\gamma_{ace}^2

Faith’s Phylogenetic Diversity (Faith’s PD)

Given n branches with lengths L and a binary vector A indicating presence (1) or absence (0) of descendants on each branch:

\sum_{i = 1}^{n} L_i A_i