Introduction
Matrices and multi-dimensional arrays are workhorses of data analysis
in R. h5lite is designed to make saving and loading these
objects to HDF5 files as seamless as possible, automatically handling
dimensions and data layout.
This vignette covers the basics of working with matrices and arrays,
and then dives into two important technical details: the
dimnames limitation and the automatic handling of row-major
vs. column-major data ordering.
For details on other data structures, see
vignette("atomic-vectors") and
vignette("data-frames").
1. Writing and Reading Matrices
Writing a matrix or array is as simple as calling
h5_write(). h5lite will automatically detect
the dimensions of your R object and create an HDF5 dataset with a
corresponding dataspace.
# A simple 2x3 matrix
my_matrix <- matrix(1:6, nrow = 2, ncol = 3)
print(my_matrix)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
h5_write(file, "my_matrix", my_matrix)You can verify the dimensions of the on-disk dataset using
h5_dim().
h5_dim(file, "my_matrix")
#> [1] 2 3When you read the data back with h5_read(),
h5lite restores the dimensions, giving you an identical R
matrix.
The same process works for higher-dimensional arrays.
The dimnames Limitation
A crucial limitation to be aware of involves dimnames
(the row and column names of a matrix). HDF5 does not have a native way
to store these, and R implements them as a list
attribute.
The h5lite function h5_write_attr() cannot
write list-like attributes. Therefore, attempting to write a matrix with
dimnames while attrs = TRUE will result in an
error.
named_matrix <- matrix(1:4, nrow = 2, ncol = 2,
dimnames = list(c("row1", "row2"), c("col1", "col2")))
str(attributes(named_matrix))
#> List of 2
#> $ dim : int [1:2] 2 2
#> $ dimnames:List of 2
#> ..$ : chr [1:2] "row1" "row2"
#> ..$ : chr [1:2] "col1" "col2"
# This will fail because the 'dimnames' attribute is a list.
h5_write(file, "named_matrix", named_matrix, attrs = TRUE)
#> Error in validate_attrs(data, attrs): Attribute 'dimnames' cannot be written to HDF5 because its type ('list') is not supported. Only atomic vectors and factors can be written as attributes.Workaround
The solution is to either remove the dimnames before
writing or, more simply, write with attrs = FALSE (the
default). This will successfully write the matrix data but will discard
the dimnames.
# This works, but the dimnames are not saved.
h5_write(file, "named_matrix", named_matrix, attrs = FALSE)
read_named_matrix <- h5_read(file, "named_matrix")
# The data is correct, but the names are gone.
print(read_named_matrix)
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
dimnames(read_named_matrix)
#> NULLAdvanced Details: Row-Major vs. Column-Major Order
One of the most common sources of error when using HDF5 with R is managing the different data layouts.
- R stores matrices and arrays in column-major order. In memory, the elements of the first column are contiguous, followed by the second column, and so on.
- HDF5 (along with C, C++, and Python’s NumPy) uses row-major order. The elements of the first row are contiguous in memory/on disk.
h5lite completely automates the
transposition required to move between these two layouts.
How it Works
Consider our 2x3 my_matrix:
print(my_matrix)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6-
On
h5_write(): The C-level code reads the R object’s column-major data (1, 2, 3, 4, 5, 6) and transposes it into a row-major buffer (1, 3, 5, 2, 4, 6) before writing it to the HDF5 file. -
On
h5_read(): The C-level code reads the row-major data from the file and transposes it back into R’s native column-major layout.
This “it just works” behavior is a core design principle of
h5lite. It ensures that the matrix you read back is
identical to the one you wrote, without requiring you to perform manual
array transpositions (aperm()) or think about data
ordering. This is a significant convenience compared to lower-level HDF5
interfaces.
Compression
For large matrices, using compression can significantly reduce file
size. Simply set compress = TRUE (which uses a default
compression level of 5) or specify an integer from 1-9.
h5lite automatically creates chunked storage when
compression is enabled, which is a prerequisite for HDF5 compression
filters.
large_matrix <- matrix(rnorm(1e6), nrow = 1000)
# Write with default compression
h5_write(file, "large_matrix_compressed", large_matrix, compress = TRUE)
# Write without compression
h5_write(file, "large_matrix_uncompressed", large_matrix, compress = FALSE)
# Compare file sizes (in a real scenario, the compressed version would be smaller)
h5_ls(file, full.names = TRUE)
#> [1] "my_matrix" "my_array"
#> [3] "named_matrix" "large_matrix_compressed"
#> [5] "large_matrix_uncompressed"Note: For random data like in this example, compression is not very effective. It works best on data with repeating patterns or low entropy.
# Clean up the temporary file
unlink(file)