Chao1 alpha diversity metric.
Usage
chao1(counts, cpus = n_cpus())
Arguments
- counts
An OTU abundance matrix where each column is a sample, and each row is an OTU. Any object coercible with
as.matrix()
can be given here, as well asphyloseq
,rbiom
,SummarizedExperiment
, andTreeSummarizedExperiment
objects.- cpus
How many parallel processing threads should be used. The default,
n_cpus()
, will use all logical CPU cores.
Details
The Chao1 index is a non-parametric estimator that seeks to predict the true species richness of a community, including species that were likely missed due to undersampling. It works by using the counts of the rarest observed taxa — specifically "singletons" (taxa seen only once) and "doubletons" (taxa seen twice) — to estimate how many species were not detected at all. The logic is that if you find many species represented by only one or two individuals, it is highly probable that many other rare species were missed entirely.
Important Caveat: The Chao1 estimator is mathematically dependent on singleton counts. However, modern bioinformatic pipelines that generate ASVs (like DADA2) are designed to remove singletons, as they are often indistinguishable from sequencing errors. Using Chao1 on data that has had singletons removed will produce a scientifically meaningless result that is often just the same as the observed richness. Therefore, this metric is considered methodologically unsound for most modern ASV-based workflows and should be used with extreme caution.
Calculation
Prerequisite: all counts are whole numbers.
In the formulas below, x
is a single column (sample) from counts
.
\(n\) is the total number of non-zero OTUs, \(a\) is the number of
singletons, and \(b\) is the number of doubletons.
$$D = \displaystyle n + \frac{a^{2}}{2b}$$
Note that when \(x\) does not have any singletons or doubletons
(\(a = 0, b = 0\)), the result will be NaN
. When \(x\) has singletons
but no doubletons (\(a > 0, b = 0\)), the result will be Inf
.
References
Chao A 1984. Non-parametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11:265-270.
Examples
# Example counts matrix
ex_counts
#> Saliva Gums Nose Stool
#> Streptococcus 162 793 22 1
#> Bacteroides 2 4 2 611
#> Corynebacterium 0 0 498 1
#> Haemophilus 180 87 2 1
#> Propionibacterium 1 1 251 0
#> Staphylococcus 0 1 236 1
# Chao1 diversity values
chao1(ex_counts)
#> Saliva Gums Nose Stool
#> 4.5 Inf 6.0 Inf
# Low diversity
chao1(c(100, 1, 1, 1, 1)) # Inf
#> [1] Inf
# High diversity
chao1(c(20, 20, 20, 20, 20)) # NaN
#> [1] NaN
# Low richness
chao1(1:3) # 3.5
#> [1] 3.5
# High richness
chao1(1:100) # 100.5
#> [1] 100.5