Probabilistic Symmetric Chi-Squared distance

A chi-squared based distance metric for comparing probability distributions.

Usage

psym_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts: A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.
margin: The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1
pairs: Which combinations of samples should distances be calculated for? The default value (NULL) calculates all-vs-all. Provide a numeric or logical vector specifying positions in the distance matrix to calculate. See examples.
cpus: How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Probabilistic Symmetric $\chi^2$ distance is defined as: $$2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}$$

Where:

$P_i$, $Q_i$ : Proportional abundances of the $i$-th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
2 * sum((p - q)^2 / (p + q))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.

Examples

    psym_chisq(ex_counts)
#>          Saliva      Gums      Nose
#> Gums  0.8481868                    
#> Nose  3.7831387 3.7855551          
#> Stool 3.9279613 3.9329354 3.9390787