Beta Diversity Metrics
Usage
aitchison(counts, pseudocount = NULL, pairs = NULL, cpus = n_cpus())
bhattacharyya(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
bray(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
canberra(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
chebyshev(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
chord(counts, pairs = NULL, cpus = n_cpus())
clark(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
divergence(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
euclidean(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
gower(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
hellinger(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
horn(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
jensen(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
jsd(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
lorentzian(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
manhattan(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
matusita(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
minkowski(counts, rescale = TRUE, power = 1.5, pairs = NULL, cpus = n_cpus())
morisita(counts, pairs = NULL, cpus = n_cpus())
motyka(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
psym_chisq(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
soergel(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
squared_chisq(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
squared_chord(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
squared_euclidean(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
topsoe(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
wave_hedges(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())
hamming(counts, pairs = NULL, cpus = n_cpus())
jaccard(counts, pairs = NULL, cpus = n_cpus())
ochiai(counts, pairs = NULL, cpus = n_cpus())
sorensen(counts, pairs = NULL, cpus = n_cpus())
unweighted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())
weighted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())
normalized_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())
generalized_unifrac(
counts,
tree = NULL,
alpha = 0.5,
pairs = NULL,
cpus = n_cpus()
)
variance_adjusted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())
Arguments
- counts
A numeric matrix of count data where each column is a feature, and each row is a sample. Any object coercible with
as.matrix()
can be given here, as well asphyloseq
,rbiom
,SummarizedExperiment
, andTreeSummarizedExperiment
objects.- pseudocount
The value to add to all counts in
counts
to prevent takinglog(0)
for unobserved features. The default,NULL
, selects the smallest non-zero value incounts
.- pairs
Which combinations of samples should distances be calculated for? The default value (
NULL
) calculates all-vs-all. Provide a numeric or logical vector specifying positions in the distance matrix to calculate. See examples.- cpus
How many parallel processing threads should be used. The default,
n_cpus()
, will use all logical CPU cores.- rescale
Normalize each sample's counts so they sum to
1
. Default:TRUE
- power
Scaling factor for the magnitude of differences between communities (\(p\)). Default:
1.5
- tree
A
phylo
-class object representing the phylogenetic tree for the OTUs incounts
. The OTU identifiers given bycolnames(counts)
must be present intree
. Can be omitted if a tree is embedded with thecounts
object or asattr(counts, 'tree')
.- alpha
How much weight to give to relative abundances; a value between 0 and 1, inclusive. Setting
alpha=1
is equivalent tonormalized_unifrac()
.
Formulas
Given:
\(n\) : The number of features.
\(X_i\), \(Y_i\) : Absolute counts for the \(i\)-th feature in samples \(X\) and \(Y\).
\(X_T\), \(Y_T\) : Total counts in each sample. \(X_T = \sum_{i=1}^{n} X_i\)
\(P_i\), \(Q_i\) : Proportional abundances of \(X_i\) and \(Y_i\). \(P_i = X_i / X_T\)
\(X_L\), \(Y_L\) : Mean log of abundances. \(X_L = \frac{1}{n}\sum_{i=1}^{n} \ln{X_i}\)
\(R_i\) : The range of the \(i\)-th feature across all samples (max - min).
Aitchison distance aitchison() | \(\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2}\) |
Bhattacharyya distance bhattacharyya() | \(-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}}\) |
Bray-Curtis dissimilarity bray() | \(\displaystyle \frac{\sum_{i=1}^{n} |P_i - Q_i|}{\sum_{i=1}^{n} (P_i + Q_i)}\) |
Canberra distance canberra() | \(\displaystyle \sum_{i=1}^{n} \frac{|P_i - Q_i|}{P_i + Q_i}\) |
Chebyshev distance chebyshev() | \(\max(|P_i - Q_i|)\) |
Chord distance chord() | \(\displaystyle \sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2}\) |
Clark's divergence distance clark() | \(\displaystyle \sqrt{\sum_{i=1}^{n}\left(\frac{P_i - Q_i}{P_i + Q_i}\right)^{2}}\) |
Divergence divergence() | \(\displaystyle 2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2}\) |
Euclidean distance euclidean() | \(\sqrt{\sum_{i=1}^{n} (P_i - Q_i)^2}\) |
Gower distance gower() | \(\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{|P_i - Q_i|}{R_i}\) |
Hellinger distance hellinger() | \(\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}}\) |
Horn-Morisita dissimilarity horn() | \(\displaystyle 1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2}\) |
Jensen-Shannon distance jensen() | \(\displaystyle \sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]}\) |
Jensen-Shannon divergence (JSD) jsd() | \(\displaystyle \frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]\) |
Lorentzian distance lorentzian() | \(\sum_{i=1}^{n}\ln{(1 + |P_i - Q_i|)}\) |
Manhattan distance manhattan() | \(\sum_{i=1}^{n} |P_i - Q_i|\) |
Matusita distance matusita() | \(\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2}\) |
Minkowski distance minkowski() | \(\sqrt[p]{\sum_{i=1}^{n} (P_i - Q_i)^p}\) Where \(p\) is the geometry of the space. |
Morisita dissimilarity * Integers Only morisita() | \(\displaystyle 1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\displaystyle \left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}}\) |
Motyka dissimilarity motyka() | \(\displaystyle \frac{\sum_{i=1}^{n} \max(P_i, Q_i)}{\sum_{i=1}^{n} (P_i + Q_i)}\) |
Probabilistic Symmetric \(\chi^2\) distance psym_chisq() | \(\displaystyle 2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\) |
Soergel distance soergel() | \(\displaystyle \frac{\sum_{i=1}^{n} |P_i - Q_i|}{\sum_{i=1}^{n} \max(P_i, Q_i)}\) |
Squared \(\chi^2\) distance squared_chisq() | \(\displaystyle \sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\) |
Squared Chord distance squared_chord() | \(\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2\) |
Squared Euclidean distance squared_euclidean() | \(\sum_{i=1}^{n} (P_i - Q_i)^2\) |
Topsoe distance topsoe() | \(\displaystyle \sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\) |
Wave Hedges distance wave_hedges() | \(\displaystyle \frac{\sum_{i=1}^{n} |P_i - Q_i|}{\sum_{i=1}^{n} \max(P_i, Q_i)}\) |
Presence / Absence
Given:
\(A\), \(B\) : Number of features in each sample.
\(J\) : Number of features in common.
Dice-Sorensen dissimilarity sorensen() | \(\displaystyle \frac{2J}{(A + B)}\) |
Hamming distance hamming() | \(\displaystyle (A + B) - 2J\) |
Jaccard distance jaccard() | \(\displaystyle 1 - \frac{J}{(A + B - J)]}\) |
Otsuka-Ochiai dissimilarity ochiai() | \(\displaystyle 1 - \frac{J}{\sqrt{AB}}\) |
Phylogenetic
Given \(n\) branches with lengths \(L\) and a pair of samples' binary (\(A\) and \(B\)) or proportional abundances (\(P\) and \(Q\)) on each of those branches.
Unweighted UniFrac unweighted_unifrac() | \(\displaystyle \frac{1}{n}\sum_{i=1}^{n} L_i|A_i - B_i|\) |
Weighted UniFrac weighted_unifrac() | \(\displaystyle \sum_{i=1}^{n} L_i|P_i - Q_i|\) |
Normalized Weighted UniFrac normalized_unifrac() | \(\displaystyle \frac{\sum_{i=1}^{n} L_i|P_i - Q_i|}{\sum_{i=1}^{n} L_i(P_i + Q_i)}\) |
Generalized UniFrac (GUniFrac) generalized_unifrac() | \(\displaystyle \frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left|\displaystyle \frac{P_i - Q_i}{P_i + Q_i}\right|}{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}}\) Where \(\alpha\) is a scalable weighting factor. |
Variance-Adjusted Weighted UniFrac variance_adjusted_unifrac() | \(\displaystyle \frac{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{|P_i - Q_i|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }\) |
See vignette('unifrac')
for detailed example UniFrac calculations.
References
Levy, A., Shalom, B. R., & Chalamish, M. (2024). A guide to similarity measures. arXiv.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
Examples
# Example counts matrix
t(ex_counts)
#> Saliva Gums Nose Stool
#> Streptococcus 162 793 22 1
#> Bacteroides 2 4 2 611
#> Corynebacterium 0 0 498 1
#> Haemophilus 180 87 2 1
#> Propionibacterium 1 1 251 0
#> Staphylococcus 0 1 236 1
bray(ex_counts)
#> Saliva Gums Nose
#> Gums 0.4265973
#> Nose 0.9713843 0.9720256
#> Stool 0.9909509 0.9911046 0.9915177
jaccard(ex_counts)
#> Saliva Gums Nose
#> Gums 0.2000000
#> Nose 0.3333333 0.1666667
#> Stool 0.5000000 0.3333333 0.1666667
generalized_unifrac(ex_counts, tree = ex_tree)
#> Saliva Gums Nose
#> Gums 0.4471644
#> Nose 0.8215129 0.7607876
#> Stool 0.9727827 0.9784242 0.9730332
# Only calculate distances for Saliva vs all.
bray(ex_counts, pairs = 1:3)
#> Saliva Gums Nose
#> Gums 0.4265973
#> Nose 0.9713843 NA
#> Stool 0.9909509 NA NA