A generalized metric that includes Euclidean and Manhattan distance as special cases.
Usage
minkowski(
counts,
margin = 1L,
power = 1.5,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)Arguments
- counts
A numeric matrix of count data (samples \(\times\) features). Typically contains absolute abundances (integer counts), though proportions are also accepted.
- margin
The margin containing samples.
1if samples are rows,2if samples are columns. Ignored whencountsis a special object class (e.g.phyloseq). Default:1- power
Scaling factor for the magnitude of differences between communities (\(p\)). Default:
1.5- norm
Normalize the incoming counts. Options are:
'none': No transformation.'percent': Relative abundance (sample abundances sum to 1).'binary': Unweighted presence/absence (each count is either 0 or 1).'clr': Centered log ratio.
Default:
'none'.- pseudocount
Value added to counts to handle zeros when
norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.- pairs
Which combinations of samples should distances be calculated for? The default value (
NULL) calculates all-vs-all. Provide a numeric or logical vector specifying positions in the distance matrix to calculate. See examples.- cpus
How many parallel processing threads should be used. The default,
n_cpus(), will use all logical CPU cores.
Details
The Minkowski distance is defined as: $$\sqrt[p]{\sum_{i=1}^{n} (X_i - Y_i)^p}$$
Where:
\(X_i\), \(Y_i\) : Absolute abundances of the \(i\)-th feature.
\(n\) : The number of features.
\(p\) : The geometry of the space (power parameter).
Parameter: power
The power parameter (default 1.5) determines the value of \(p\) in the equation.
Special Cases
Manhattan distance: When \(p = 1\), the formula reduces to the sum of absolute differences.
Euclidean distance: When \(p = 2\), the formula reduces to the standard straight-line distance.
Chebyshev distance: When \(p \to \infty\), the formula reduces to the maximum absolute difference.
Base R Equivalent:
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseqrbiomSummarizedExperimentTreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Springer.
Minkowski, H. (1896). Geometrie der Zahlen. Teubner.
See also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
