One of the most powerful features of HDF5 is its ability to compress data transparently. When you read a compressed dataset, the HDF5 library automatically decompresses it for you. This can lead to significant savings in storage space and faster I/O performance by reducing the amount of data that needs to be transferred from disk.
hdf5lib provides built-in support for gzip
compression and enables the HDF5 library’s dynamic filter mechanism,
allowing users to leverage a wide range of other compression
algorithms.
Built-in Compression (gzip/deflate)
hdf5lib bundles the zlib library, which
provides the deflate (commonly known as gzip)
compression algorithm. This means that any package linking to
hdf5lib can create and read gzip-compressed
datasets out-of-the-box, with no extra configuration.
To create a compressed dataset, you must do two things:
- Enable Chunking: Compression in HDF5 requires the data to be stored in “chunks.” You must define a chunk size for your dataset.
-
Set the Filter: You must add the
deflatefilter to the dataset creation property list.
C++ Example
The following Rcpp example demonstrates how to create a
chunked and compressed dataset.
#include <Rcpp.h>
#include <hdf5.h>
#include <vector>
//' Create a compressed dataset
//'
//' @param filename Path to the HDF5 file.
//' @param dsetname Name of the dataset to create.
//' @export
// [[Rcpp::export]]
void create_compressed_dataset(std::string filename, std::string dsetname) {
// Some sample data
std::vector<int> data(1000, 42);
hsize_t dims = { data.size() };
// 1. Create the file and dataspace
hid_t file_id = H5Fcreate(filename.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
hid_t space_id = H5Screate_simple(1, dims, NULL);
// 2. Create a dataset creation property list
hid_t dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
// 3. Enable chunking (required for compression)
hsize_t chunk_dims = { 100 };
H5Pset_chunk(dcpl_id, 1, chunk_dims);
// 4. Set the deflate (gzip) filter with compression level 6
H5Pset_deflate(dcpl_id, 6);
// 5. Create the dataset using the property list
hid_t dset_id = H5Dcreate2(file_id, dsetname.c_str(), H5T_NATIVE_INT,
space_id, H5P_DEFAULT, dcpl_id, H5P_DEFAULT);
// 6. Write the data
H5Dwrite(dset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, data.data());
// 7. Clean up all resources
H5Pclose(dcpl_id);
H5Dclose(dset_id);
H5Sclose(space_id);
H5Fclose(file_id);
}When you read this dataset back (using any HDF5-aware tool), the decompression will be handled automatically.
External Filters (Blosc, LZ4, etc.)
HDF5 supports a much wider range of compression algorithms (e.g.,
Blosc, LZ4, Bzip2) through a dynamic plugin mechanism. These plugins are
shared libraries (.so on Linux, .dylib on
macOS, .dll on Windows) that the HDF5 library can load at
runtime to handle new compression formats.
hdf5lib is compiled with the necessary flags to enable
this dynamic loading feature. However, hdf5lib does
not bundle any external filters. It is the end-user’s
responsibility to install them.
How It Works for the User
Install the Filter: The user must obtain and install the desired filter plugin. These are often distributed by other scientific software packages (e.g.,
h5pyfor Python) or can be compiled from source.Set the Plugin Path: The user must tell the HDF5 library where to find the installed plugins by setting the
HDF5_PLUGIN_PATHenvironment variable.
Once the environment is configured, any R package that links to
hdf5lib (such as h5lite or your own package)
will be able to read and write datasets using that filter without any
changes to its own code.
For example, if a user has the Blosc filter plugin installed in
/opt/hdf5/plugins, they can enable it in R like this:
# Tell HDF5 where to find filter plugins
Sys.setenv(HDF5_PLUGIN_PATH = "/opt/hdf5/plugins/")
# Now, any function that uses hdf5lib can read a Blosc-compressed file.
# For example, with 'h5lite' which links against hdf5lib:
# data <- h5lite::h5_read("my_blosc_file.h5", "my_dataset")This powerful design means that as a package developer linking to
hdf5lib, you get gzip support for free, and
your package automatically gains the ability to work with any
compression filter your users have installed.
