One of the most powerful features of HDF5 is its ability to compress data transparently. When you read a compressed dataset, the HDF5 library automatically decompresses it for you. This can lead to significant savings in storage space and faster I/O performance.
hdf5lib provides built-in support for gzip
and szip compression natively, and additionally bundles an
extensive suite of high-performance modern filter plugins. All
compression plugins bundled with hdf5lib have been
rigorously tested to ensure strict interoperability with the standard
Python h5py ecosystem.
Terminology
To navigate HDF5 compression effectively, it is helpful to understand the specific terminology used throughout this guide and the broader HDF5 ecosystem:
Filter: A function in the HDF5 data pipeline that transforms data before it is written to disk or after it is read from disk.
Pre-filter: A specific type of filter (like Bitshuffle) designed to rearrange bytes to make the data more mathematically compressible before handing it off to the main compressor.
Compressor / Codec: The algorithm (e.g., Zstd, LZ4, ZFP) that actually shrinks the physical data footprint.
Plugin: An external, dynamically loaded library (e.g., a
.soor.dllfile) that provides the HDF5 library with the code needed to execute a filter that isn’t built into the core library.Mode: A specific operational setting for a compressor (e.g., ZFP’s “Fixed Accuracy” mode or LZ4’s “High Compression” mode).
Fundamental Requirements for HDF5 Compression
Before configuring any filter, you must understand the fundamental rules of HDF5 compression:
Chunking is Mandatory: Compression in HDF5 requires the data to be stored in “chunks.” You must define a chunk size for your dataset using
H5Pset_chunk(). Applying a filter to a contiguous dataset will fail.Filter Registration (Plugins Only): If you are using any of the bundled plugins (Zstd, LZ4, Blosc, etc.), they must be registered globally in your R session. You should do this exactly once during your package’s
.onLoadhook usinghdf5lib_register_all_filters()and tear them down in.onUnloadwithhdf5lib_destroy_all_filters(). Registering filters per-I/O operation will severely degrade performance. See the Getting Started guide for the boilerplate code.The Reader Must Have the Plugin: If you compress a dataset using an external plugin (like Zstd or LZ4), any application that opens that file must also have the corresponding HDF5 plugin installed. If you are generating files for external users and cannot guarantee they have these plugins installed, use the built-in
gzipfilter for maximum portability.
Filter Implementation Details
When applying a filter via
H5Pset_filter(plist_id, filter_id, flags, cd_nelmts, cd_values),
you must pass an array of unsigned int values known as
cd_values (Client Data Values). These values configure the
specific behavior of the compressor.
Below are the explicit configurations and examples for every filter
bundled with hdf5lib.
1. GZIP / Deflate
Recommendation: Use for Maximum Portability. GZIP is built directly into the core HDF5 library, meaning files compressed with it can be read by any standard HDF5 installation globally without needing external plugins. However, it is significantly slower than modern alternatives.
-
Filter ID:
H5Z_FILTER_DEFLATE(1) -
Elements (
cd_nelmts): 1 -
cd_values[0]: Compression level from0(none) to9(maximum).
You do not need to manually pack the cd_values array for
GZIP. HDF5 provides a dedicated helper function.
2. Zstandard (Zstd)
Recommendation: Use as the Default General-Purpose Compressor. Developed by Facebook, Zstd provides compression ratios comparable to or better than GZIP, but at substantially faster compression and decompression speeds.
-
Filter ID:
32015 -
Elements (
cd_nelmts): 1 -
cd_values[0]: Compression level, from1(fastest) to22(best). Use3or5as a default; these provide an optimal balance of speed and file size, whereas levels above 10 offer steeply diminishing returns for significantly longer compression times.
3. LZ4
Recommendation: Use for Maximum Speed. LZ4 prioritizes extreme compression and decompression speeds (often bottlenecked only by RAM bandwidth) over finding the absolute smallest file size. Ideal for hot data, fast pipelines, or real-time I/O.
-
Filter ID:
32004 -
Elements (
cd_nelmts): 2 -
cd_values[0]: Reserved/padding (Pass0). -
cd_values[1]: Compression level. Passing0uses the standard, lightning-fast LZ4. Passing a value>0(e.g.,9) enables LZ4-HC (High Compression) mode.
4. Blosc2
Recommendation: Use for Parallel Compression and Complex Pipelines. Blosc2 utilizes internal thread pools to compress blocks of data in parallel. It also features a Programmable Filter Pipeline, allowing you to easily chain multiple pre-filters (like Delta followed by Bitshuffle) before the data hits the final compressor.
Our plugin implementation uses a highly efficient bitmask
architecture to specify these complex pipelines while strictly
maintaining a standard 8-element configuration, ensuring 100% layout
compatibility with standard h5py and community plugins.
Restrictions on ZFP: When using a ZFP compressor codec (
33,34,35), you must pass floating-point data (float32orfloat64). ZFP performs lossy compression directly on numerical values; if bytes are shuffled before ZFP, it corrupts the dataset. To ensure safety, this plugin automatically disables all Blosc2 pre-filters if a ZFP codec is selected. You must also ensure no upstream native HDF5 filters (e.g.,H5Pset_shuffle) are applied before Blosc2 in your property list.
Plugin Configuration
Blosc2 pipelines are defined by summing the bit values of your desired pre-filters into a single mask.
-
Filter ID:
32026 -
Elements (
cd_nelmts): 8 -
cd_values[0-3]: Reserved (Pass0). -
cd_values[4]: Compression level (0to9). -
cd_values[5]: Pre-filter Bitmask. Sum the following values to chain filters (they will automatically execute in the optimal order: Truncprec -> Delta -> Bitshuffle/Shuffle):-
0= No Filter -
1= Byte Shuffle -
2= Bitshuffle (Prioritized if both 1 and 2 are passed) -
4= Delta -
8= Truncate Precision
-
-
cd_values[6]: Compressor ID (0=blosclz,1=lz4,2=lz4hc,4=zlib,5=zstd,11=ndlz,33=zfp_acc,34=zfp_prec,35=zfp_rate). -
cd_values[7]: Metadata Value. This serves a dual purpose based on your configuration:-
Truncate Precision: If the
truncprecpre-filter is active (bitmask incd_values[5]includes8), this integer defines the number of bits of precision to keep. -
ZFP Metadata: Because
cd_valuesare strictly passed as unsigned integers, Blosc2 applies specific internal formulas to map this 8-bit integer (uint8_t) to ZFP’s required floating-point parameters:-
Precision (
34): Passed directly as an integer. For 16 bits of precision, pass16. -
Accuracy (
33): Interpreted as a base-10 exponent (10^meta). To pass a negative exponent like-3(for an accuracy tolerance of0.001), you must pass its 8-bit unsigned equivalent:256 - 3 = 253. -
Rate (
35): Interpreted as a percentage of the original data size. For example, to achieve a rate of 8 bits per value on 32-bit floats, the compressed output must be exactly 25% of the original size (8 / 32 = 0.25), so you pass25.
-
Precision (
- Otherwise, pass
0.
-
Truncate Precision: If the
Example 2: Multi-Filter Pipeline
Chaining the Delta filter and Bitshuffle before hitting Zstd (level 5).
Example 3: Truncate Precision Pipeline
Truncating floats to 16 bits of precision, applying Bitshuffle, and compressing with LZ4.
Example 4: Blosc2 + ZFP Precision
Using Blosc2’s meta-compressor to run ZFP in Fixed Precision mode, keeping exactly 16 bits of precision.
Example 5: Blosc2 + ZFP Accuracy
Using ZFP Accuracy mode to enforce an absolute error tolerance of
0.001 (10^-3).
5. ZFP
Recommendation: Use for High-Speed Lossy Compression. ZFP mathematically simplifies multidimensional arrays to achieve massive compression ratios. Use when perfect bit-for-bit reconstruction is not required.
Restrictions: ZFP strictly requires numeric, multidimensional arrays (natively supporting 32/64-bit integers and 32/64-bit floats). Any filter applied to the data before it reaches ZFP will completely corrupt the dataset. You must not apply upstream HDF5 pre-filters (e.g., Shuffle, Bitshuffle, Scale-Offset) before ZFP in your property list.
(Note: While ZFP can be invoked as a codec within the Blosc2 meta-compressor, doing so restricts it to floating-point data only and limits the available compression modes to Accuracy, Precision, and Rate. For integer arrays, Expert mode, or Reversible (lossless) mode, you must use this standalone ZFP plugin).
The ZFP Modes Explained Simply
Fixed Accuracy: Defines a strict absolute error tolerance. Passing
3ensures no decompressed value deviates from the original by more than (0.001), regardless of whether the number is huge or tiny.Fixed Precision: Defines a sliding scale of detail. Passing
16keeps roughly 5 digits of detail scaled proportionally to the size of the number, allowing larger absolute errors on massive numbers and tiny absolute errors on microscopic fractions.Fixed Rate: Defines a strict storage limit. Passing
8forces the compressor to use exactly 8 bits of storage per value, guaranteeing a highly predictable file size.Expert: Allows manual configuration of the underlying ZFP parameters (
minbits,maxbits,maxprec,minexp) for advanced users requiring highly specific encoding profiles.Reversible: Provides perfectly lossless compression, guaranteeing exact bit-for-bit reconstruction of the original numerical array.
Plugin Configuration
-
Filter ID:
32013 -
Elements (
cd_nelmts): 6 -
cd_values[0]: The compression mode (1=Rate,2=Precision,3=Accuracy,4=Expert,5=Reversible/Lossless). -
cd_values[1-5]: Mode-specific parameters. For instance, in Precision mode (cd_values[0]=2),cd_values[2]specifies the bits of precision to keep, padding the rest with zeros (e.g.,{2, 0, 16, 0, 0, 0}).
While you can manually pack the 6-element cd_values
array, the ZFP plugin exposes convenient external helper functions that
you can declare in your C code to set the parameters effortlessly.
// Declare the external helpers provided by the bundled ZFP plugin
extern herr_t H5Pset_zfp_rate(hid_t plist, double rate);
extern herr_t H5Pset_zfp_precision(hid_t plist, unsigned int prec);
extern herr_t H5Pset_zfp_accuracy(hid_t plist, double acc);
extern herr_t H5Pset_zfp_expert(hid_t plist, unsigned int minbits, unsigned int maxbits, unsigned int maxprec, int minexp);
extern herr_t H5Pset_zfp_reversible(hid_t plist); // Lossless mode
// ... later in your code ...
// Compress data using ZFP's Accuracy mode (maintaining 0.001 tolerance)
H5Pset_zfp_accuracy(plist_id, 0.001);6. Bitshuffle
Recommendation: Use to Boost Compression of Structured Data. Bitshuffle is not a standalone compressor; it is a pre-filter that transposes the bits in structured arrays (like integers, floats, or compound datatypes) to expose redundancy, before handing that data off to an internal compressor (LZ4 or Zstd). It vastly improves compression ratios for highly structured datasets.
-
Filter ID:
32008 -
Elements (
cd_nelmts): 3 -
cd_values[0]: Block size. Pass0to let the library choose the optimal default (usually 1024). -
cd_values[1]: Compressor algorithm.0= raw bitshuffling (no compression),2= LZ4,3= Zstd. -
cd_values[2]: Compression level. Only applies if Zstd is selected (e.g.,5). Pass0for uncompressed or LZ4.
7. Blosc (v1)
Recommendation: Use for Backwards Compatibility. Blosc v1 was a pioneering meta-compressor, but it has been entirely superseded by Blosc2 (which is faster, smarter, and supports more features). This filter is bundled strictly to ensure backward compatibility for reading older HDF5 files. Use Blosc2 for all new development.
-
Filter ID:
32001 -
Elements (
cd_nelmts): 7 -
cd_values[0-3]: Reserved (Pass0). -
cd_values[4]: Compression level (0to9). Note: Blosc enforces a universal 0-9 scale. If you select a codec with a wider range like Zstd, Blosc internally maps this 0-9 value to Zstd’s 1-22 scale. -
cd_values[5]: Pre-filter (0=nofilter,1=byte shuffle,2=bit shuffle). -
cd_values[6]: Compressor ID (0=blosclz,1=lz4,2=lz4hc,3=snappy,4=zlib,5=zstd).
8. SZIP
Recommendation: Use for Backwards
Compatibility. Historically used in legacy NASA and Earthdata
datasets, the original SZIP algorithm carried strict licensing
encumbrances that stifled widespread adoption. Modern tools like ZFP or
Blosc2 are vastly superior for scientific data. This is included
primarily for reading legacy archives (note: hdf5lib safely
circumvents historical licensing issues by bundling libaec,
a permissively licensed, drop-in replacement).
Restrictions: SZIP strictly requires numeric data (integers or floating-point numbers). It does not support compound datatypes, strings, or variable-length arrays. The chunk size (in elements) must also be an exact multiple of the block size.
-
Filter ID:
H5Z_FILTER_SZIP(4) -
Elements (
cd_nelmts): 2 -
cd_values[0]: Options mask (e.g.,H5_SZIP_NN_OPTION_MASKorH5_SZIP_EC_OPTION_MASK). -
cd_values[1]: Pixels per block. Must be even, typically8,16, or32.
Like GZIP, SZIP has a dedicated helper function.
9. BZIP2
Recommendation: Use for Backwards Compatibility. While BZIP2 historically offered fantastic compression ratios, it is notoriously slow to compress and decompress. Zstd now achieves comparable or better ratios in a fraction of the time. Bundled strictly to allow reading of legacy archives.
-
Filter ID:
307 -
Elements (
cd_nelmts): 1 -
cd_values[0]: Block size / Compression level from1(fastest) to9(best). Default is9.
10. LZF & Snappy
Recommendation: Use for Backwards
Compatibility. Both LZF and Snappy were early pioneers of
extreme-speed compression. However, LZ4 generally outperforms them in
modern environments. LZF is included primarily because it was the
default high-speed compressor in early versions of the Python
h5py library.
Both LZF and Snappy are extremely fast, low-overhead algorithms. They require zero configuration parameters.
-
LZF Filter ID:
32000 -
Snappy Filter ID:
32003 -
Elements (
cd_nelmts): 0
Dynamically Loading Additional External Filters
While hdf5lib bundles an extensive suite of filters, it
cannot include every possible algorithm. For instance, high-performance
filters like SZ3 or VBZ are not
bundled in order to maintain the package’s strict C-only,
zero-dependency footprint.
However, HDF5 supports dynamically loading these algorithms at
runtime through its plugin mechanism. hdf5lib is fully
compiled with the necessary flags to enable this feature.
How It Works for the User
-
Install the Filter: The user obtains and installs
the desired filter plugin (
.so,.dylib, or.dllfiles). -
Set the Plugin Path: The user must tell the HDF5
library where to find the installed plugins by setting the
HDF5_PLUGIN_PATHenvironment variable.
Once configured, any R package linking to hdf5lib can
seamlessly read datasets using that external filter.
# Tell HDF5 where to find external filter plugins
Sys.setenv(HDF5_PLUGIN_PATH = "/opt/hdf5/plugins/")
# Now, any function that uses hdf5lib can decompress an SZ3 file automatically