Introduction
In high-performance computing, it’s common to use parallel processing to speed up data analysis. R offers several parallel programming models, which can be broadly categorized into two types:
-
Multi-threading: Using packages like
RcppParallel, where multiple threads operate within a single R process, sharing the same memory space. -
Multi-processing: Using packages like
future,parallel, orcallr, where multiple independent R processes are launched, each with its own memory space.
When multiple threads or processes need to access the same HDF5 file,
it’s critical to understand the concurrency guarantees provided by the
HDF5 library to avoid data corruption. This vignette explains how
h5lite behaves in these scenarios and provides best
practices for safe parallel I/O.
HDF5 Thread-Safety
h5lite links to the hdf5lib R package,
which provides a static build of the HDF5 library (version 1.14.6 as of
this writing). This version of HDF5 was compiled with the
--enable-threadsafe option.
What does this mean? According to the official HDF5 documentation, the thread-safety feature ensures that all HDF5 API calls are protected by a global mutex. [1] This means:
- Only one HDF5 function can execute at a time within a single process, even across multiple threads.
- It prevents race conditions and memory corruption when multiple
threads in the same process (e.g., with
RcppParallel) callh5litefunctions concurrently. - The library is “safe” in that it won’t crash or corrupt your file, but write operations are serialized, not performed in parallel.
Important: This built-in thread-safety only applies to multiple threads within a single process. It does not coordinate access between multiple independent processes.
Multi-threaded Access (e.g., RcppParallel)
When using a multi-threaded framework, the HDF5 library’s internal mutex handles synchronization for you.
Best Practices
Concurrent Reads are Safe: Multiple threads can safely call
h5_read()on the same file simultaneously. The HDF5 library will manage access, and each thread will receive the correct data.Concurrent Writes are Serialized: Multiple threads can call
h5_write()without corrupting the file. However, the global HDF5 lock means the writes will execute one at a time. This is safe but offers no performance benefit for writing. If multiple threads attempt to write, they will be blocked until the lock is released.Avoid Writing to the Same Dataset: While the library prevents file corruption, having multiple threads attempt to write to the exact same dataset path can lead to unpredictable results (i.e., which thread’s data ends up in the file last is non-deterministic). It is much safer to have each thread write to a unique dataset.
Example: Writing from Multiple Threads
Here is a complete Rcpp example using the
RcppParallel package. It demonstrates how multiple threads
can safely call back to the R h5_write() function. The HDF5
library’s internal mutex serializes these calls, preventing file
corruption.
Note: This pattern is safe but does not provide a performance boost for writing, as the writes happen one at a time. The primary benefit is for “read-heavy” parallel operations, or for simplifying code where a parallel section also needs to perform occasional, non-performance-critical writes.
// [[Rcpp::depends(RcppParallel)]]
#include <Rcpp.h>
#include <RcppParallel.h>
struct H5Writer : public RcppParallel::Worker {
const std::string file;
Rcpp::Function h5_write_r;
// Constructor to receive the file path and the R function
H5Writer(const std::string& file, Rcpp::Function h5_write_r)
: file(file), h5_write_r(h5_write_r) {}
// The parallel workhorse function
void operator()(std::size_t begin, std::size_t end) {
for (std::size_t i = begin; i < end; ++i) {
// Each thread writes to a unique dataset path
std::string dset_name = "thread_data/" + std::to_string(i);
Rcpp::NumericVector data = Rcpp::NumericVector::create(i);
// Call back into R. The HDF5 global lock makes this thread-safe.
h5_write_r(file, dset_name, data);
}
}
};
// [[Rcpp::export]]
void parallel_write_r_callback(std::string file, int n_threads) {
// Look up the h5lite::h5_write function from its namespace
Rcpp::Environment h5lite_ns = Rcpp::Environment::namespace_env("h5lite");
Rcpp::Function h5_write_r = h5lite_ns["h5_write"];
H5Writer writer(file, h5_write_r);
RcppParallel::parallelFor(0, n_threads, writer);
}Multi-process Access (e.g., future,
callr)
When multiple independent processes access the same HDF5 file, the situation is more complex. The HDF5 library’s internal thread-safety lock is irrelevant because each process has its own memory and its own instance of the library.
By default, the HDF5 library does not use file locking to coordinate access between processes. [2] This means:
- Concurrent Reads are Generally Safe: If the file is not being modified, multiple processes can read from it simultaneously without issue.
- Concurrent Writes are NOT Safe: If two or more processes attempt to write to the same HDF5 file at the same time, you are very likely to corrupt the file. This is a fundamental limitation.
- Mixed Read/Write is NOT Safe (by default): If one process is writing while another is reading, the reading process may see inconsistent or corrupted data.
Best Practice: External Locking
To safely write to a single HDF5 file from multiple processes, you must implement an external locking mechanism. The goal is to ensure that only one process can have the file open for writing at any given time.
The interprocess R package is an excellent tool for
this. It provides a file-based mutex that works across different R
processes.
The workflow is: 1. Acquire a lock using
interprocess::lock(). 2. Perform the h5lite
write operation. 3. Release the lock using
interprocess::unlock().
Example: Writing from Multiple Processes with future
and interprocess
This example uses the future framework to spawn multiple
R sessions. Each session attempts to write to the same HDF5 file, but
access is protected by an interprocess lock.
library(future)
library(future.apply)
library(interprocess)
library(h5lite)
# Use a temporary file for the HDF5 data
h5_file <- tempfile(fileext = ".h5")
# Create a lock file associated with our HDF5 file
# This lock file will coordinate access across processes.
lock_file <- paste0(h5_file, ".lock")
# Set up a parallel backend with 2 processes
plan(multisession, workers = 2)
message("HDF5 file: ", h5_file)
message("Lock file: ", lock_file)
# Use future_lapply to run this code in parallel sessions
future_lapply(1:4, function(i) {
# --- CRITICAL SECTION START ---
# Acquire the lock. This call will block until the lock is available.
# The timeout prevents it from waiting forever if something goes wrong.
lck <- interprocess::lock(lock_file, timeout = 30000)
# Now that we have the lock, it is safe to write.
message(sprintf("Process %d acquired lock, writing...", Sys.getpid()))
# Create a unique name for this process's data
dset_name <- paste0("process_data/", Sys.getpid(), "_", i)
data_to_write <- runif(5)
# Perform the write operation
h5_write(h5_file, dset_name, data_to_write)
# Simulate some work
Sys.sleep(0.5)
# Release the lock so another process can acquire it
interprocess::unlock(lck)
message(sprintf("Process %d released lock.", Sys.getpid()))
# --- CRITICAL SECTION END ---
return(TRUE)
})
# Shut down the parallel workers
plan(sequential)
# Inspect the final file - it contains data from all processes
h5_ls(h5_file, recursive = TRUE)
# Clean up
unlink(h5_file)
unlink(lock_file)Without the interprocess::lock() and
unlock() calls, the above code would be a race condition
with a high probability of creating a corrupted HDF5 file.
Advanced Topic: Single-Writer/Multiple-Reader (SWMR)
HDF5 offers a specific feature for the “one writer, many readers” scenario called SWMR. This allows one process to write to a file while multiple other processes read from it, without the readers needing to constantly re-open the file.
Currently, h5lite does not provide an
explicit API to enable or use SWMR mode. This is an advanced feature
that requires careful setup of file access properties. For use cases
requiring SWMR, a lower-level R package like rhdf5 or
hdf5r would be more appropriate.
Summary of Recommendations
| Scenario | Environment | Safety |
h5lite Action |
Best Practice |
|---|---|---|---|---|
| Concurrent Reads | Multi-threaded | Safe | h5_read() |
No special action needed. |
| Concurrent Reads | Multi-process | Safe | h5_read() |
Safe as long as no process is writing. |
| Concurrent Writes | Multi-threaded | Safe | h5_write() |
Safe, but writes are serialized. Write to unique datasets. |
| Concurrent Writes | Multi-process | UNSAFE | h5_write() |
Must use an external lock (e.g.,
interprocess). |
| Mixed Read/Write | Multi-process | UNSAFE |
h5_read()/h5_write()
|
Must use an external lock for the writer. |
References
- The HDF5 Group. (2024). Thread-safety in the HDF5 Library. [Online]. Available: https://docs.hdfgroup.org/hdf5/v1_14/develop/threadsafe.html
- The HDF5 Group. (2024). Concurrent Access to HDF5 Files. [Online]. Available: https://docs.hdfgroup.org/hdf5/v1_14/develop/conc_access.html
# Clean up the temporary file
unlink(file)