Introduction
Beta diversity is a measure of how different two samples are. The
different metrics described in this vignette quantify that difference,
referred to as the “distance” or “dissimilarity” between a pair of
samples. The distance is typically 0 for identical samples
and 1 for completely different samples.
Input Data
We will use the ex_counts feature table included with
ecodive. It contains the number of observations of each bacterial genera
in each sample. In the text below, you can substitute the word ‘genera’
for the feature of interest in your own data.
library(ecodive)
counts <- rarefy(ex_counts)
t(counts)
#> Saliva Gums Nose Stool
#> Streptococcus 162 309 6 1
#> Bacteroides 2 2 0 341
#> Corynebacterium 0 0 171 1
#> Haemophilus 180 34 0 1
#> Propionibacterium 1 0 82 0
#> Staphylococcus 0 0 86 1Looking at the matrix above, you can see that saliva and gums are similar, while saliva and stool are quite different.
Weighted vs Unweighted Metrics
Before selecting a formula, it is important to distinguish between weighted and unweighted metrics.
- Weighted metrics take relative abundances into account.
- Unweighted metrics only consider presence/absence (binary data).
You can consult list_metrics() to see which category a
specific metric falls into or to list all available options
programmatically.
list_metrics('beta', 'id', weighted = FALSE)
#> [1] "sorensen" "hamming" "jaccard" "ochiai" "unweighted_unifrac"
list_metrics('beta', 'id', weighted = TRUE)
#> [1] "aitchison" "bhattacharyya" "bray"
#> [4] "canberra" "chebyshev" "chord"
#> [7] "clark" "divergence" "euclidean"
#> [10] "generalized_unifrac" "gower" "hellinger"
#> [13] "horn" "jensen" "jsd"
#> [16] "lorentzian" "manhattan" "matusita"
#> [19] "minkowski" "morisita" "motyka"
#> [22] "normalized_unifrac" "psym_chisq" "soergel"
#> [25] "squared_chisq" "squared_chord" "squared_euclidean"
#> [28] "topsoe" "variance_adjusted_unifrac" "wave_hedges"
#> [31] "weighted_unifrac"Formulas
The following tables detail the mathematical definitions for the
metrics available in ecodive.
Abundance-Based (Weighted)
Given:
- n : The number of features.
- X_i, Y_i : Absolute counts for the i-th feature in samples X and Y.
- X_T, Y_T : Total counts in each sample. X_T = \sum_{i=1}^{n} X_i
- P_i, Q_i : Proportional abundances of X_i and Y_i. P_i = X_i / X_T
- X_L, Y_L : Mean log of abundances. X_L = \frac{1}{n}\sum_{i=1}^{n} \ln{X_i}
- R_i : The range of the i-th feature across all samples (max - min).
Aitchison distance aitchison()
|
\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2} |
Bhattacharyya distance bhattacharyya()
|
-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}} |
Bray-Curtis dissimilarity bray()
|
\displaystyle \frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} (X_i + Y_i)} |
Canberra distance canberra()
|
\displaystyle \sum_{i=1}^{n} \frac{|X_i - Y_i|}{X_i + Y_i} |
Chebyshev distance chebyshev()
|
\max(|X_i - Y_i|) |
Chord distance chord()
|
\displaystyle \sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2} |
Clark’s divergence distance clark()
|
\displaystyle \sqrt{\sum_{i=1}^{n}\left(\frac{X_i - Y_i}{X_i + Y_i}\right)^{2}} |
Divergence divergence()
|
\displaystyle 2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2} |
Euclidean distance euclidean()
|
\sqrt{\sum_{i=1}^{n} (X_i - Y_i)^2} |
Gower distance gower()
|
\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{|X_i - Y_i|}{R_i} |
Hellinger distance hellinger()
|
\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}} |
Horn-Morisita dissimilarity horn()
|
\displaystyle 1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2} |
Jensen-Shannon distance jensen()
|
\displaystyle \sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]} |
Jensen-Shannon divergence (JSD) jsd()
|
\displaystyle \frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right] |
Lorentzian distance lorentzian()
|
\sum_{i=1}^{n}\ln{(1 + |X_i - Y_i|)} |
Manhattan distance manhattan()
|
\sum_{i=1}^{n} |X_i - Y_i| |
Matusita distance matusita()
|
\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2} |
Minkowski distance minkowski()
|
\sqrt[p]{\sum_{i=1}^{n} (X_i
- Y_i)^p} Where p is the geometry of the space. |
|
Morisita dissimilarity (integers only) morisita()
|
\displaystyle 1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\displaystyle \left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}} |
Motyka dissimilarity motyka()
|
\displaystyle \frac{\sum_{i=1}^{n} \max(X_i, Y_i)}{\sum_{i=1}^{n} (X_i + Y_i)} |
Probabilistic Symmetric \chi^2 distance psym_chisq()
|
\displaystyle 2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i} |
Soergel distance soergel()
|
\displaystyle \frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} \max(X_i, Y_i)} |
Squared \chi^2
distance squared_chisq()
|
\displaystyle \sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i} |
Squared Chord distance squared_chord()
|
\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2 |
Squared Euclidean distance squared_euclidean()
|
\sum_{i=1}^{n} (X_i - Y_i)^2 |
Topsoe distance topsoe()
|
\displaystyle \sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right) |
Wave Hedges distance wave_hedges()
|
\displaystyle \sum_{i=1}^{n}\frac{|X_i - Y_i|}{\max(X_i, Y_i)} |
Presence / Absence (Unweighted)
Given:
- A, B : Number of features in each sample.
- J : Number of features in common.
Dice-Sorensen dissimilarity sorensen()
|
\displaystyle \frac{2J}{(A + B)} |
Hamming distance hamming()
|
\displaystyle (A + B) - 2J |
Jaccard distance jaccard()
|
\displaystyle 1 - \frac{J}{(A + B - J)} |
Otsuka-Ochiai dissimilarity ochiai()
|
\displaystyle 1 - \frac{J}{\sqrt{AB}} |
Note:
sorensen()is equivalent tobray(norm = 'binary'), andjaccard()is equivalent tosoergel(norm = 'binary').
Phylogenetic
Given n branches with lengths L and a pair of samples’ binary (A and B) or proportional abundances (P and Q) on each of those branches.
Unweighted UniFrac unweighted_unifrac()
|
\displaystyle \frac{1}{n}\sum_{i=1}^{n} L_i|A_i - B_i| |
Weighted UniFrac weighted_unifrac()
|
\displaystyle \sum_{i=1}^{n} L_i|P_i - Q_i| |
Normalized Weighted UniFrac normalized_unifrac()
|
\displaystyle \frac{\sum_{i=1}^{n} L_i|P_i - Q_i|}{\sum_{i=1}^{n} L_i(P_i + Q_i)} |
Generalized UniFrac (GUniFrac) generalized_unifrac()
|
\displaystyle
\frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left|\displaystyle
\frac{P_i - Q_i}{P_i + Q_i}\right|}{\sum_{i=1}^{n} L_i(P_i +
Q_i)^{\alpha}} Where \alpha is a scalable weighting factor. |
Variance-Adjusted Weighted UniFrac
variance_adjusted_unifrac()
|
\displaystyle \frac{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{|P_i - Q_i|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} } |
See https://cmmr.github.io/ecodive/articles/unifrac.html for detailed example UniFrac calculations.
Partial Calculation
The default value of pairs=NULL in ecodive’s beta
diversity functions results in the returned all-vs-all distance matrix
being completely filled in.
bray(counts)
#> Saliva Gums Nose
#> Gums 0.4260870
#> Nose 0.9797101 0.9826087
#> Stool 0.9884058 0.9884058 0.9913043If you are doing a reference-vs-all comparison, you can use the
pairs parameter to skip unwanted calculations and save some
CPU time. The larger the dataset, the more noticeable the improvement
will be.
bray(counts, pairs = 1:3)
#> Saliva Gums Nose
#> Gums 0.4260870
#> Nose 0.9797101 NA
#> Stool 0.9884058 NA NAThe pairs argument can be:
- A numeric vector, giving the positions in the result to calculate.
- A logical vector, indicating whether to calculate a position in the result.
- A
function(i,j)that returns whether rowsiandjshould be compared.
Therefore, all of the following are equivalent:
bray(counts, pairs = 1:3)
bray(counts, pairs = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE))
bray(counts, pairs = function (i, j) i == 1)The ordering of pairs follows the pairings produced by
combn().
# Column index pairings
combn(nrow(counts), 2)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1 1 1 2 2 3
#> [2,] 2 3 4 3 4 4
# Sample name pairings
combn(rownames(counts), 2)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] "Saliva" "Saliva" "Saliva" "Gums" "Gums" "Nose"
#> [2,] "Gums" "Nose" "Stool" "Nose" "Stool" "Stool"So, for instance, to use gums as the reference sample:
References
Levy, A., Shalom, B. R., & Chalamish, M. (2024). A guide to similarity measures. arXiv.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
