Skip to contents

Beta Diversity Metrics

Usage

aitchison(counts, pseudocount = NULL, pairs = NULL, cpus = n_cpus())

bhattacharyya(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

bray(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

canberra(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

chebyshev(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

chord(counts, pairs = NULL, cpus = n_cpus())

clark(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

divergence(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

euclidean(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

gower(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

hellinger(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

horn(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

jensen(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

jsd(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

lorentzian(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

manhattan(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

matusita(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

minkowski(counts, rescale = TRUE, power = 1.5, pairs = NULL, cpus = n_cpus())

morisita(counts, pairs = NULL, cpus = n_cpus())

motyka(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

psym_chisq(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

soergel(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

squared_chisq(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

squared_chord(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

squared_euclidean(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

topsoe(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

wave_hedges(counts, rescale = TRUE, pairs = NULL, cpus = n_cpus())

hamming(counts, pairs = NULL, cpus = n_cpus())

jaccard(counts, pairs = NULL, cpus = n_cpus())

ochiai(counts, pairs = NULL, cpus = n_cpus())

sorensen(counts, pairs = NULL, cpus = n_cpus())

unweighted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())

weighted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())

normalized_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())

generalized_unifrac(
  counts,
  tree = NULL,
  alpha = 0.5,
  pairs = NULL,
  cpus = n_cpus()
)

variance_adjusted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data where each column is a feature, and each row is a sample. Any object coercible with as.matrix() can be given here, as well as phyloseq, rbiom, SummarizedExperiment, and TreeSummarizedExperiment objects.

pseudocount

The value to add to all counts in counts to prevent taking log(0) for unobserved features. The default, NULL, selects the smallest non-zero value in counts.

pairs

Which combinations of samples should distances be calculated for? The default value (NULL) calculates all-vs-all. Provide a numeric or logical vector specifying positions in the distance matrix to calculate. See examples.

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

rescale

Normalize each sample's counts so they sum to 1. Default: TRUE

power

Scaling factor for the magnitude of differences between communities (\(p\)). Default: 1.5

tree

A phylo-class object representing the phylogenetic tree for the OTUs in counts. The OTU identifiers given by colnames(counts) must be present in tree. Can be omitted if a tree is embedded with the counts object or as attr(counts, 'tree').

alpha

How much weight to give to relative abundances; a value between 0 and 1, inclusive. Setting alpha=1 is equivalent to normalized_unifrac().

Value

A dist object.

Formulas

Given:

  • \(n\) : The number of features.

  • \(X_i\), \(Y_i\) : Absolute counts for the \(i\)-th feature in samples \(X\) and \(Y\).

  • \(X_T\), \(Y_T\) : Total counts in each sample. \(X_T = \sum_{i=1}^{n} X_i\)

  • \(P_i\), \(Q_i\) : Proportional abundances of \(X_i\) and \(Y_i\). \(P_i = X_i / X_T\)

  • \(X_L\), \(Y_L\) : Mean log of abundances. \(X_L = \frac{1}{n}\sum_{i=1}^{n} \ln{X_i}\)

  • \(R_i\) : The range of the \(i\)-th feature across all samples (max - min).

Aitchison distance
aitchison()
\(\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2}\)
Bhattacharyya distance
bhattacharyya()
\(-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}}\)
Bray-Curtis dissimilarity
bray()
\(\displaystyle \frac{\sum_{i=1}^{n} |P_i - Q_i|}{\sum_{i=1}^{n} (P_i + Q_i)}\)
Canberra distance
canberra()
\(\displaystyle \sum_{i=1}^{n} \frac{|P_i - Q_i|}{P_i + Q_i}\)
Chebyshev distance
chebyshev()
\(\max(|P_i - Q_i|)\)
Chord distance
chord()
\(\displaystyle \sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2}\)
Clark's divergence distance
clark()
\(\displaystyle \sqrt{\sum_{i=1}^{n}\left(\frac{P_i - Q_i}{P_i + Q_i}\right)^{2}}\)
Divergence
divergence()
\(\displaystyle 2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2}\)
Euclidean distance
euclidean()
\(\sqrt{\sum_{i=1}^{n} (P_i - Q_i)^2}\)
Gower distance
gower()
\(\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{|P_i - Q_i|}{R_i}\)
Hellinger distance
hellinger()
\(\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}}\)
Horn-Morisita dissimilarity
horn()
\(\displaystyle 1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2}\)
Jensen-Shannon distance
jensen()
\(\displaystyle \sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]}\)
Jensen-Shannon divergence (JSD)
jsd()
\(\displaystyle \frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]\)
Lorentzian distance
lorentzian()
\(\sum_{i=1}^{n}\ln{(1 + |P_i - Q_i|)}\)
Manhattan distance
manhattan()
\(\sum_{i=1}^{n} |P_i - Q_i|\)
Matusita distance
matusita()
\(\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2}\)
Minkowski distance
minkowski()
\(\sqrt[p]{\sum_{i=1}^{n} (P_i - Q_i)^p}\)
Where \(p\) is the geometry of the space.
Morisita dissimilarity
* Integers Only
morisita()
\(\displaystyle 1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\displaystyle \left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}}\)
Motyka dissimilarity
motyka()
\(\displaystyle \frac{\sum_{i=1}^{n} \max(P_i, Q_i)}{\sum_{i=1}^{n} (P_i + Q_i)}\)
Probabilistic Symmetric \(\chi^2\) distance
psym_chisq()
\(\displaystyle 2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\)
Soergel distance
soergel()
\(\displaystyle \frac{\sum_{i=1}^{n} |P_i - Q_i|}{\sum_{i=1}^{n} \max(P_i, Q_i)}\)
Squared \(\chi^2\) distance
squared_chisq()
\(\displaystyle \sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\)
Squared Chord distance
squared_chord()
\(\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2\)
Squared Euclidean distance
squared_euclidean()
\(\sum_{i=1}^{n} (P_i - Q_i)^2\)
Topsoe distance
topsoe()
\(\displaystyle \sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\)
Wave Hedges distance
wave_hedges()
\(\displaystyle \frac{\sum_{i=1}^{n} |P_i - Q_i|}{\sum_{i=1}^{n} \max(P_i, Q_i)}\)

Presence / Absence

Given:

  • \(A\), \(B\) : Number of features in each sample.

  • \(J\) : Number of features in common.

Dice-Sorensen dissimilarity
sorensen()
\(\displaystyle \frac{2J}{(A + B)}\)
Hamming distance
hamming()
\(\displaystyle (A + B) - 2J\)
Jaccard distance
jaccard()
\(\displaystyle 1 - \frac{J}{(A + B - J)]}\)
Otsuka-Ochiai dissimilarity
ochiai()
\(\displaystyle 1 - \frac{J}{\sqrt{AB}}\)

Phylogenetic

Given \(n\) branches with lengths \(L\) and a pair of samples' binary (\(A\) and \(B\)) or proportional abundances (\(P\) and \(Q\)) on each of those branches.

Unweighted UniFrac
unweighted_unifrac()
\(\displaystyle \frac{1}{n}\sum_{i=1}^{n} L_i|A_i - B_i|\)
Weighted UniFrac
weighted_unifrac()
\(\displaystyle \sum_{i=1}^{n} L_i|P_i - Q_i|\)
Normalized Weighted UniFrac
normalized_unifrac()
\(\displaystyle \frac{\sum_{i=1}^{n} L_i|P_i - Q_i|}{\sum_{i=1}^{n} L_i(P_i + Q_i)}\)
Generalized UniFrac (GUniFrac)
generalized_unifrac()
\(\displaystyle \frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left|\displaystyle \frac{P_i - Q_i}{P_i + Q_i}\right|}{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}}\)
Where \(\alpha\) is a scalable weighting factor.
Variance-Adjusted Weighted UniFrac
variance_adjusted_unifrac()
\(\displaystyle \frac{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{|P_i - Q_i|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }\)

See vignette('unifrac') for detailed example UniFrac calculations.

References

Levy, A., Shalom, B. R., & Chalamish, M. (2024). A guide to similarity measures. arXiv.

Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.

Examples

    # Example counts matrix
    t(ex_counts)
#>                   Saliva Gums Nose Stool
#> Streptococcus        162  793   22     1
#> Bacteroides            2    4    2   611
#> Corynebacterium        0    0  498     1
#> Haemophilus          180   87    2     1
#> Propionibacterium      1    1  251     0
#> Staphylococcus         0    1  236     1
    
    bray(ex_counts)
#>          Saliva      Gums      Nose
#> Gums  0.4265973                    
#> Nose  0.9713843 0.9720256          
#> Stool 0.9909509 0.9911046 0.9915177
    
    jaccard(ex_counts)
#>          Saliva      Gums      Nose
#> Gums  0.2000000                    
#> Nose  0.3333333 0.1666667          
#> Stool 0.5000000 0.3333333 0.1666667
    
    generalized_unifrac(ex_counts, tree = ex_tree)
#>          Saliva      Gums      Nose
#> Gums  0.4471644                    
#> Nose  0.8215129 0.7607876          
#> Stool 0.9727827 0.9784242 0.9730332
    
    # Only calculate distances for Saliva vs all.
    bray(ex_counts, pairs = 1:3)
#>          Saliva      Gums      Nose
#> Gums  0.4265973                    
#> Nose  0.9713843        NA          
#> Stool 0.9909509        NA        NA