One of the most powerful features of HDF5 is its ability to compress data transparently. When you read a compressed dataset, the HDF5 library automatically decompresses it for you. This can lead to significant savings in storage space and faster I/O performance by reducing the amount of data that needs to be transferred from disk.
hdf5lib provides built-in support for gzip
and szip compression natively, and additionally bundles a
full suite of high-performance modern filter plugins. It also enables
the HDF5 library’s dynamic filter mechanism, allowing users to leverage
an even wider range of external compression algorithms at runtime.
Native Compression (gzip and szip)
hdf5lib bundles the zlib and
libaec libraries, which provide the deflate
(commonly known as gzip) and szip compression
algorithms directly to the core HDF5 library. This means any package
linking to hdf5lib can create and read these datasets
out-of-the-box, with no extra configuration or plugin registration.
To create a compressed dataset, you must do two things:
- Enable Chunking: Compression in HDF5 requires the data to be stored in “chunks.” You must define a chunk size for your dataset.
-
Set the Filter: You must add the
deflate(or other) filter to the dataset creation property list.
C++ Example
The following Rcpp example demonstrates how to create a
chunked and gzip-compressed dataset.
#include <Rcpp.h>
#include <hdf5.h>
#include <vector>
//' Create a compressed dataset
//'
//' @param filename Path to the HDF5 file.
//' @param dsetname Name of the dataset to create.
//' @export
// [[Rcpp::export]]
void create_compressed_dataset(std::string filename, std::string dsetname) {
// Some sample data
std::vector<int> data(1000, 42);
hsize_t dims = { data.size() };
// 1. Create the file and dataspace
hid_t file_id = H5Fcreate(filename.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
hid_t space_id = H5Screate_simple(1, dims, NULL);
// 2. Create a dataset creation property list
hid_t dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
// 3. Enable chunking (required for compression)
hsize_t chunk_dims = { 100 };
H5Pset_chunk(dcpl_id, 1, chunk_dims);
// 4. Set the deflate (gzip) filter with compression level 6
H5Pset_deflate(dcpl_id, 6);
// 5. Create the dataset using the property list
hid_t dset_id = H5Dcreate2(file_id, dsetname.c_str(), H5T_NATIVE_INT,
space_id, H5P_DEFAULT, dcpl_id, H5P_DEFAULT);
// 6. Write the data
H5Dwrite(dset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, data.data());
// 7. Clean up all resources
H5Pclose(dcpl_id);
H5Dclose(dset_id);
H5Sclose(space_id);
H5Fclose(file_id);
}When you read this dataset back (using any HDF5-aware tool), the decompression will be handled automatically.
Bundled Filter Plugins
In addition to native GZIP and SZIP, hdf5lib bundles
several high-performance filter plugins.
Initializing the Plugins (Best Practice)
To use the non-native plugins (everything except GZIP and SZIP), you
must explicitly register them using the functions provided in
hdf5lib.h.
Crucial Warning on Performance: When you register
filters using hdf5lib_register_all_filters(), it modifies
the global state of the HDF5 library and initializes the Blosc2 engine,
which spins up a background pool of worker threads for parallel
compression. Tearing this down via
hdf5lib_destroy_all_filters() joins and destroys those
threads.
You should never register and destroy filters
per-I/O operation. Doing so will severely impact performance through
thread thrashing and locking overhead. Instead, it is highly recommended
to tie registration to your R package’s lifecycle using
.onLoad and .onUnload hooks.
1. Create C Wrappers (e.g.,
src/init.c)
#include <Rinternals.h>
#include "hdf5lib.h"
SEXP r_register_hdf5_filters() {
hdf5lib_register_all_filters();
return R_NilValue;
}
SEXP r_destroy_hdf5_filters() {
hdf5lib_destroy_all_filters();
return R_NilValue;
}2. Set up R Hooks (e.g., R/zzz.R)
Available Filters Summary
The following table summarizes the compression filters available
natively or via bundled plugins in hdf5lib.
| Filter Name | HDF5 Filter ID | Target Use Case | Data Restrictions |
|---|---|---|---|
| GZIP (Deflate) |
1 (Native) |
General purpose, universal compatibility. | None |
| SZIP |
4 (Native) |
Legacy scientific data compression. | Numeric data only. Strict block size rules. |
| BZIP2 | 307 |
General purpose, high compression ratio. | None |
| LZF | 32000 |
Very fast, lightweight compression. | None |
| Blosc / Blosc2 |
32001 / 32026
|
Meta-compressor. Extremely fast, multi-threaded capable block compression. | None |
| Snappy | 32003 |
Fast, low-CPU overhead compression (Google). | None |
| LZ4 | 32004 |
Lightning fast decompression. | None |
| Bitshuffle | 32008 |
Reorders bits for highly compressible layouts. | None |
| ZFP | 32013 |
Lossy/lossless compression for multidimensional arrays. | Numeric arrays only (Float/Int). |
| Zstandard (Zstd) | 32015 |
Modern standard balancing extreme speed and high compression. | None |
Filter Details and cd_values
When applying a filter via
H5Pset_filter(plist_id, filter_id, flags, cd_nelmts, cd_values),
you must pass an array of unsigned int values known as
cd_values (Client Data Values). These configure how the
filter behaves (e.g., compression level).
Below are the expected parameters for each bundled filter.
GZIP / Deflate (H5Z_FILTER_DEFLATE)
Instead of H5Pset_filter, you can apply this
conveniently using H5Pset_deflate(plist_id, level).
-
Elements (
cd_nelmts): 1 -
cd_values[0]: Compression level from0(no compression) to9(maximum compression). Default is usually6.
SZIP (H5Z_FILTER_SZIP)
You can apply it using
H5Pset_szip(plist_id, options_mask, pixels_per_block).
- Restriction: Numeric data only. Does not support strings or variable length types. The chunk size (in elements) must be an exact multiple of the block size.
-
Elements (
cd_nelmts): 2 -
cd_values[0]: Options mask (e.g.,H5_SZIP_NN_OPTION_MASKorH5_SZIP_EC_OPTION_MASK). -
cd_values[1]: Pixels per block. Must be even and is typically8,16, or32.
BZIP2 (H5Z_FILTER_BZIP2)
-
Elements (
cd_nelmts): 1 -
cd_values[0]: Block size / Compression Level from1(fastest) to9(best). Default is9.
LZF (H5Z_FILTER_LZF)
-
Elements (
cd_nelmts): 0 (No configuration parameters required).
Blosc (H5Z_FILTER_BLOSC) & Blosc2
(H5Z_FILTER_BLOSC2)
-
Elements (
cd_nelmts): 7 -
cd_values[0-3]: Reserved (Pass0). -
cd_values[4]: Compression level (0to9). -
cd_values[5]: Shuffle mode (0= no shuffle,1= byte shuffle,2= bit shuffle). -
cd_values[6]: Compressor ID (0=blosclz,1=lz4,2=lz4hc,3=snappy,4=zlib,5=zstd,6=zfp (Blosc2 only),11=ndlz (Blosc2 only)).
Snappy (H5Z_FILTER_SNAPPY)
-
Elements (
cd_nelmts): 0 (No configuration parameters required).
LZ4 (H5Z_FILTER_LZ4)
-
Elements (
cd_nelmts): 2 -
cd_values[0]: Reserved/padding (Pass0). -
cd_values[1]: Compression level. Passing0uses standard fast LZ4. Passing a value>0(e.g.,9) enables LZ4-HC (High Compression).
Bitshuffle (H5Z_FILTER_BSHUF)
-
Elements (
cd_nelmts): 2 -
cd_values[0]: Block size. Pass0to let the library choose the optimal default (1024). -
cd_values[1]: Compression algorithm. Pass0for raw bitshuffling (no compression) or2to apply LZ4 compression after shuffling.
ZFP (H5Z_FILTER_ZFP)
- Restriction: Numeric arrays only. Does not support strings, compound data types, or 1D arrays of bytes.
-
Elements (
cd_nelmts): 6. -
cd_values[0]: The compression mode.1=Rate,2=Precision,3=Accuracy,4=Expert,5=Reversible (Lossless). -
cd_values[1-5]: Mode-specific parameters. For instance, in Precision mode (cd_values[0]=2),cd_values[2]specifies the bits of precision to keep, padding the rest with zeros (e.g.,{2, 0, 16, 0, 0, 0}).
Zstandard / Zstd (H5Z_FILTER_ZSTD)
-
Elements (
cd_nelmts): 1 -
cd_values[0]: Compression level, generally from1to22. A good default is3.
Dynamically Loading Additional External Filters
While hdf5lib bundles an extensive suite of filters, it
cannot include every possible algorithm. For instance, high-performance
filters written in C++ like SZ3 or VBZ
are not bundled in order to maintain the package’s strict C-only,
zero-dependency compilation footprint.
However, HDF5 supports dynamically loading these algorithms at
runtime through its plugin mechanism. These plugins are shared libraries
(.so on Linux, .dylib on macOS,
.dll on Windows) that the HDF5 library can load when it
encounters a compressed dataset.
hdf5lib is compiled with the necessary flags to enable
this dynamic loading feature.
How It Works for the User
-
Install the Filter: The user must obtain and
install the desired filter plugin (like the SZ3 or VBZ plugin). These
can often be compiled from source or are distributed by other scientific
software packages (e.g.,
h5pyfor Python). -
Set the Plugin Path: The user must tell the HDF5
library where to find the installed plugins by setting the
HDF5_PLUGIN_PATHenvironment variable.
Once the environment is configured, any R package that links to
hdf5lib (such as h5lite or your own package)
will be able to read and write datasets using that external filter
without any changes to its own C/C++ code.
For example, if a user has the SZ3 filter plugin installed in
/opt/hdf5/plugins, they can enable it in R like this:
# Tell HDF5 where to find filter plugins
Sys.setenv(HDF5_PLUGIN_PATH = "/opt/hdf5/plugins/")
# Now, any function that uses hdf5lib can read an SZ3-compressed file.
# For example, with 'h5lite' which links against hdf5lib:
# data <- h5lite::h5_read("my_sz3_file.h5", "my_dataset")This powerful design means that as a package developer linking to
hdf5lib, you get native GZIP/SZIP and the bundled plugins
for free, while your package automatically retains the ability to work
with any external compression filter your users have installed.
