Chao1

Chao1 alpha diversity metric.

Usage

chao1(counts, cpus = n_cpus())

Arguments

counts: An OTU abundance matrix where each column is a sample, and each row is an OTU. Any object coercible with as.matrix() can be given here, as well as phyloseq, rbiom, SummarizedExperiment, and TreeSummarizedExperiment objects.
cpus: How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Value

A numeric vector.

Details

The Chao1 index is a non-parametric estimator that seeks to predict the true species richness of a community, including species that were likely missed due to undersampling. It works by using the counts of the rarest observed taxa — specifically "singletons" (taxa seen only once) and "doubletons" (taxa seen twice) — to estimate how many species were not detected at all. The logic is that if you find many species represented by only one or two individuals, it is highly probable that many other rare species were missed entirely.

Important Caveat: The Chao1 estimator is mathematically dependent on singleton counts. However, modern bioinformatic pipelines that generate ASVs (like DADA2) are designed to remove singletons, as they are often indistinguishable from sequencing errors. Using Chao1 on data that has had singletons removed will produce a scientifically meaningless result that is often just the same as the observed richness. Therefore, this metric is considered methodologically unsound for most modern ASV-based workflows and should be used with extreme caution.

Calculation

Prerequisite: all counts are whole numbers.

In the formulas below, x is a single column (sample) from counts. $n$ is the total number of non-zero OTUs, $a$ is the number of singletons, and $b$ is the number of doubletons.

$$D = \displaystyle n + \frac{a^{2}}{2b}$$

  x <- c(1, 0, 3, 2, 6)
  sum(x>0) + (sum(x==1) ^ 2) / (2 * sum(x==2))
  #>  4.5

Note that when $x$ does not have any singletons or doubletons ($a = 0, b = 0$), the result will be NaN. When $x$ has singletons but no doubletons ($a > 0, b = 0$), the result will be Inf.

References

Chao A 1984. Non-parametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11:265-270.

Examples

    # Example counts matrix
    ex_counts
#>                   Saliva Gums Nose Stool
#> Streptococcus        162  793   22     1
#> Bacteroides            2    4    2   611
#> Corynebacterium        0    0  498     1
#> Haemophilus          180   87    2     1
#> Propionibacterium      1    1  251     0
#> Staphylococcus         0    1  236     1
    
    # Chao1 diversity values
    chao1(ex_counts)
#> Saliva   Gums   Nose  Stool 
#>    4.5    Inf    6.0    Inf 
    
    # Low diversity
    chao1(c(100, 1, 1, 1, 1)) # Inf
#> [1] Inf
    
    # High diversity
    chao1(c(20, 20, 20, 20, 20)) # NaN
#> [1] NaN
    
    # Low richness
    chao1(1:3) # 3.5
#> [1] 3.5
    
    # High richness
    chao1(1:100) # 100.5
#> [1] 100.5