A key feature of hdf5lib is that it is compiled with
HDF5’s thread-safety mode enabled. This prevents data corruption when
multiple threads make calls to the HDF5 library simultaneously.
Important Limitations of Thread-Safety
-
Only Low-Level APIs are Thread-Safe: The HDF5 library’s
thread-safety guarantee only applies to the low-level (LL)
functions (e.g.,
H5F…,H5D…,H5P…). The High-Level (HL) APIs, such asH5LT,H5IM, andH5TB, are not thread-safe and must not be called from multiple threads. -
Threads vs. Processes: This feature provides safety for multiple
threads within a single R process. It does not
protect against multiple independent processes writing to the
same HDF5 file at the same time. To manage multi-process access, you
must use an external mechanism like file locking (e.g., via the R
package
flock).
Here is an example of a function that uses RcppParallel
to write to different datasets within the same HDF5 file in
parallel.
DESCRIPTION and Makevars
First, add RcppParallel to the LinkingTo
field in your DESCRIPTION file. Your
src/Makevars file remains the same as in the “Getting
Started” article.
C++ Code with RcppParallel
The following C++ code defines a “worker” struct that writes a simple
dataset using the thread-safe low-level API.
RcppParallel::parallelFor executes this worker on multiple
threads, with each thread writing to a different dataset.
#include <Rcpp.h>
#include <RcppParallel.h>
#include <hdf5.h>
struct H5Writer : public RcppParallel::Worker {
const std::string filename;
H5Writer(const std::string filename) : filename(filename) {}
void operator()(std::size_t begin, std::size_t end) {
for (std::size_t i = begin; i < end; i++) {
// Each thread works on a different dataset
std::string dset_name = "thread_" + std::to_string(i);
int data = static_cast<int>(i);
hsize_t dims = {1};
// HDF5's thread-safety guarantees these LL calls are safe
hid_t file_id = H5Fopen(filename.c_str(), H5F_ACC_RDWR, H5P_DEFAULT);
if (file_id < 0) continue;
hid_t space_id = H5Screate_simple(1, dims, NULL);
hid_t dset_id = H5Dcreate2(file_id, dset_name.c_str(), H5T_NATIVE_INT,
space_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
if (dset_id >= 0) {
H5Dwrite(dset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, &data);
H5Dclose(dset_id);
}
H5Sclose(space_id);
H5Fclose(file_id);
}
}
};
//' Write to an HDF5 file in parallel
//' @param filename Path to the HDF5 file.
//' @param n_threads Number of threads to use.
//' @export
// [[Rcpp::export]]
void parallel_write(std::string filename, int n_threads) {
// Create the file first
hid_t file_id = H5Fcreate(filename.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
if (file_id < 0) Rcpp::stop("Failed to create file.");
H5Fclose(file_id);
H5Writer writer(filename);
RcppParallel::parallelFor(0, n_threads, writer);
}Because hdf5lib enables thread-safety, the internal
mutexes within the HDF5 library will serialize the low-level API calls,
preventing a race condition.
