Beta Diversity Metrics

Usage

aitchison(counts, pseudocount = NULL, pairs = NULL, cpus = n_cpus())

bhattacharyya(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

bray(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

canberra(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

chebyshev(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

chord(counts, pairs = NULL, cpus = n_cpus())

clark(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

divergence(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

euclidean(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

gower(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

hellinger(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

horn(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

jensen(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

jsd(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

lorentzian(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

manhattan(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

matusita(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

minkowski(counts, norm = "percent", power = 1.5, pairs = NULL, cpus = n_cpus())

morisita(counts, pairs = NULL, cpus = n_cpus())

motyka(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

psym_chisq(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

soergel(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

squared_chisq(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

squared_chord(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

squared_euclidean(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

topsoe(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

wave_hedges(counts, norm = "percent", pairs = NULL, cpus = n_cpus())

hamming(counts, pairs = NULL, cpus = n_cpus())

jaccard(counts, pairs = NULL, cpus = n_cpus())

ochiai(counts, pairs = NULL, cpus = n_cpus())

sorensen(counts, pairs = NULL, cpus = n_cpus())

unweighted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())

weighted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())

normalized_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())

generalized_unifrac(
  counts,
  tree = NULL,
  alpha = 0.5,
  pairs = NULL,
  cpus = n_cpus()
)

variance_adjusted_unifrac(counts, tree = NULL, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data where each column is a feature, and each row is a sample. Any object coercible with as.matrix() can be given here, as well as phyloseq, rbiom, SummarizedExperiment, and TreeSummarizedExperiment objects.

pseudocount

The value to add to all counts in counts to prevent taking log(0) for unobserved features. The default, NULL, selects the smallest non-zero value in counts.

pairs

Which combinations of samples should distances be calculated for? The default value (NULL) calculates all-vs-all. Provide a numeric or logical vector specifying positions in the distance matrix to calculate. See examples.

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

norm

Normalize the incoming counts. Options are:

norm = "percent" -: Relative abundance (sample abundances sum to 1).
norm = "binary" -: Unweighted presence/absence (each count is either 0 or 1).
norm = "clr" -: Centered log ratio.
norm = "none" -: No transformation.

Default: 'percent', which is the expected input for these formulas.

power

Scaling factor for the magnitude of differences between communities (\(p\)). Default: 1.5

tree

A phylo-class object representing the phylogenetic tree for the OTUs in counts. The OTU identifiers given by colnames(counts) must be present in tree. Can be omitted if a tree is embedded with the counts object or as attr(counts, 'tree').

alpha

How much weight to give to relative abundances; a value between 0 and 1, inclusive. Setting alpha=1 is equivalent to normalized_unifrac().

Value

A dist object.

Formulas

Given:

\(n\) : The number of features.
\(X_i\), \(Y_i\) : Absolute counts for the \(i\)-th feature in samples \(X\) and \(Y\).
\(X_T\), \(Y_T\) : Total counts in each sample. \(X_T = \sum_{i=1}^{n} X_i\)
\(P_i\), \(Q_i\) : Proportional abundances of \(X_i\) and \(Y_i\). \(P_i = X_i / X_T\)
\(X_L\), \(Y_L\) : Mean log of abundances. \(X_L = \frac{1}{n}\sum_{i=1}^{n} \ln{X_i}\)
\(R_i\) : The range of the \(i\)-th feature across all samples (max - min).


Aitchison distance `aitchison()`	\(\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2}\)
Bhattacharyya distance `bhattacharyya()`	\(-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}}\)
Bray-Curtis dissimilarity `bray()`	\(\displaystyle \frac{\sum_{i=1}^{n} \|P_i - Q_i\|}{\sum_{i=1}^{n} (P_i + Q_i)}\)
Canberra distance `canberra()`	\(\displaystyle \sum_{i=1}^{n} \frac{\|P_i - Q_i\|}{P_i + Q_i}\)
Chebyshev distance `chebyshev()`	\(\max(\|P_i - Q_i\|)\)
Chord distance `chord()`	\(\displaystyle \sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2}\)
Clark's divergence distance `clark()`	\(\displaystyle \sqrt{\sum_{i=1}^{n}\left(\frac{P_i - Q_i}{P_i + Q_i}\right)^{2}}\)
Divergence `divergence()`	\(\displaystyle 2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2}\)
Euclidean distance `euclidean()`	\(\sqrt{\sum_{i=1}^{n} (P_i - Q_i)^2}\)
Gower distance `gower()`	\(\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{\|P_i - Q_i\|}{R_i}\)
Hellinger distance `hellinger()`	\(\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}}\)
Horn-Morisita dissimilarity `horn()`	\(\displaystyle 1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2}\)
Jensen-Shannon distance `jensen()`	\(\displaystyle \sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]}\)
Jensen-Shannon divergence (JSD) `jsd()`	\(\displaystyle \frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]\)
Lorentzian distance `lorentzian()`	\(\sum_{i=1}^{n}\ln{(1 + \|P_i - Q_i\|)}\)
Manhattan distance `manhattan()`	\(\sum_{i=1}^{n} \|P_i - Q_i\|\)
Matusita distance `matusita()`	\(\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2}\)
Minkowski distance `minkowski()`	\(\sqrt[p]{\sum_{i=1}^{n} (P_i - Q_i)^p}\) Where \(p\) is the geometry of the space.
Morisita dissimilarity * Integers Only `morisita()`	\(\displaystyle 1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\displaystyle \left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}}\)
Motyka dissimilarity `motyka()`	\(\displaystyle \frac{\sum_{i=1}^{n} \max(P_i, Q_i)}{\sum_{i=1}^{n} (P_i + Q_i)}\)
Probabilistic Symmetric \(\chi^2\) distance `psym_chisq()`	\(\displaystyle 2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\)
Soergel distance `soergel()`	\(\displaystyle \frac{\sum_{i=1}^{n} \|P_i - Q_i\|}{\sum_{i=1}^{n} \max(P_i, Q_i)}\)
Squared \(\chi^2\) distance `squared_chisq()`	\(\displaystyle \sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\)
Squared Chord distance `squared_chord()`	\(\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2\)
Squared Euclidean distance `squared_euclidean()`	\(\sum_{i=1}^{n} (P_i - Q_i)^2\)
Topsoe distance `topsoe()`	\(\displaystyle \sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\)
Wave Hedges distance `wave_hedges()`	\(\displaystyle \frac{\sum_{i=1}^{n} \|P_i - Q_i\|}{\sum_{i=1}^{n} \max(P_i, Q_i)}\)

Presence / Absence

Given:

\(A\), \(B\) : Number of features in each sample.
\(J\) : Number of features in common.


Dice-Sorensen dissimilarity `sorensen()`	\(\displaystyle \frac{2J}{(A + B)}\)
Hamming distance `hamming()`	\(\displaystyle (A + B) - 2J\)
Jaccard distance `jaccard()`	\(\displaystyle 1 - \frac{J}{(A + B - J)]}\)
Otsuka-Ochiai dissimilarity `ochiai()`	\(\displaystyle 1 - \frac{J}{\sqrt{AB}}\)

Phylogenetic

Given \(n\) branches with lengths \(L\) and a pair of samples' binary (\(A\) and \(B\)) or proportional abundances (\(P\) and \(Q\)) on each of those branches.


Unweighted UniFrac `unweighted_unifrac()`	\(\displaystyle \frac{1}{n}\sum_{i=1}^{n} L_i\|A_i - B_i\|\)
Weighted UniFrac `weighted_unifrac()`	\(\displaystyle \sum_{i=1}^{n} L_i\|P_i - Q_i\|\)
Normalized Weighted UniFrac `normalized_unifrac()`	\(\displaystyle \frac{\sum_{i=1}^{n} L_i\|P_i - Q_i\|}{\sum_{i=1}^{n} L_i(P_i + Q_i)}\)
Generalized UniFrac (GUniFrac) `generalized_unifrac()`	\(\displaystyle \frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left\|\displaystyle \frac{P_i - Q_i}{P_i + Q_i}\right\|}{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}}\) Where \(\alpha\) is a scalable weighting factor.
Variance-Adjusted Weighted UniFrac `variance_adjusted_unifrac()`	\(\displaystyle \frac{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{\|P_i - Q_i\|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }\)

See vignette('unifrac') for detailed example UniFrac calculations.

References

Levy, A., Shalom, B. R., & Chalamish, M. (2024). A guide to similarity measures. arXiv.

Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.

Examples

    # Example counts matrix
    t(ex_counts)
#>                   Saliva Gums Nose Stool
#> Streptococcus        162  793   22     1
#> Bacteroides            2    4    2   611
#> Corynebacterium        0    0  498     1
#> Haemophilus          180   87    2     1
#> Propionibacterium      1    1  251     0
#> Staphylococcus         0    1  236     1
    
    bray(ex_counts)
#>          Saliva      Gums      Nose
#> Gums  0.4265973                    
#> Nose  0.9713843 0.9720256          
#> Stool 0.9909509 0.9911046 0.9915177
    
    jaccard(ex_counts)
#>          Saliva      Gums      Nose
#> Gums  0.2000000                    
#> Nose  0.3333333 0.1666667          
#> Stool 0.5000000 0.3333333 0.1666667
    
    generalized_unifrac(ex_counts, tree = ex_tree)
#>          Saliva      Gums      Nose
#> Gums  0.4471644                    
#> Nose  0.8215129 0.7607876          
#> Stool 0.9727827 0.9784242 0.9730332
    
    # Only calculate distances for Saliva vs all.
    bray(ex_counts, pairs = 1:3)
#>          Saliva      Gums      Nose
#> Gums  0.4265973                    
#> Nose  0.9713843        NA          
#> Stool 0.9909509        NA        NA