Abundance-based Coverage Estimator (ACE)

A non-parametric estimator of species richness that separates features into abundant and rare groups.

Usage

ace(counts, cutoff = 10L, margin = 1L, cpus = n_cpus())

Arguments

counts: A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.
cutoff: The maximum number of observations to consider "rare". Default: 10.
margin: The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1
cpus: How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The ACE metric separates features into "abundant" and "rare" groups based on a cutoff (usually 10 counts). It assumes that the presence of abundant species is certain, while the true number of rare species must be estimated.

Equations:

$$C_{ace} = 1 - \frac{F_1}{X_{rare}}$$

$$\gamma_{ace}^2 = \max\left[\frac{F_{rare} \sum_{i=1}^{r}i(i-1)F_i}{C_{ace}X_{rare}(X_{rare} - 1)} - 1, 0\right]$$

$$D_{ace} = F_{abund} + \frac{F_{rare}}{C_{ace}} + \frac{F_1}{C_{ace}}\gamma_{ace}^2$$

Where:

$r$ : Rare cutoff (default 10). Features with $\le r$ counts are considered rare.
$F_i$ : Number of features with exactly $i$ counts.
$F_1$ : Number of features where $X_i = 1$ (singletons).
$F_{rare}$ : Number of rare features where $X_i \le r$.
$F_{abund}$ : Number of abundant features where $X_i > r$.
$X_{rare}$ : Total counts belonging to rare features.
$C_{ace}$ : The sample abundance coverage estimator.
$\gamma_{ace}^2$ : The estimated coefficient of variation.

Parameter: cutoff The integer threshold distinguishing rare from abundant species. Standard practice is to use 10.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Chao, A., & Lee, S. M. (1992). Estimating the number of classes via sample coverage. Journal of the American Statistical Association, 87(417), 210-217. doi:10.1080/01621459.1992.10475194

Examples

    ace(ex_counts)
#> Saliva   Gums   Nose  Stool 
#>    5.0    8.9    6.0    NaN