Atomic vectors are the fundamental data structure in R. They include
numeric (integer and double), logical,
character, complex, and
raw vectors. This vignette explains how
h5lite maps these R types to HDF5 datasets and provides
guidance on controlling storage types and compression.
Basic Usage
Writing a vector to HDF5 is straightforward using
h5_write(). The package automatically creates the necessary
dataset and handles dimensions.
Scalars vs. 1D Arrays
In R, a “scalar” is simply a vector of length 1. However, HDF5
distinguishes between a Scalar Dataspace (a single
value with no dimensions) and a Simple Dataspace (an
array) with dimensions [1].
By default, h5lite treats length-1 vectors as 1D arrays
to maintain consistency with R’s vector behavior. To write a true HDF5
scalar, you must wrap the value in I().
# 1. Default: 1D Array (Length 1)
h5_write(42, file, "structure/array_1d")
# 2. Explicit Scalar: Wrapped in I()
h5_write(I(42), file, "structure/scalar")
h5_str(file, "structure")
#> structure/
#> ├── array_1d <uint8 × 1>
#> └── scalar <uint8 scalar>Note: When reading data back into R, both storage formats appear as standard R vectors of length 1.
Numeric and Logical Data
Automatic Type Selection
h5lite attempts to map R types to the most efficient
HDF5 equivalents automatically (as = "auto").
-
Numeric:
h5liteanalyzes the range of your data and picks the smallest fitting HDF5 type (e.g.,uint8,int16,int32,float64). -
Logicals:
h5litemaps these touint8(0 or 1) in HDF5 to save space.
Handling Missing Values (NA)
A key challenge in HDF5 is that standard integer and boolean types do
not have a native representation for NA (missing
values).
To ensure data safety, h5lite performs the following
check:
- If an integer or logical vector contains
NA, it is automatically promoted tofloat64. - The
NAvalues are stored as anNaNvariant in the file. - When read back,
h5_read()restores them asnumericvectors withNA.
# Integer vector with NO missing values -> Automatic optimal type (uint8)
h5_write(c(1L, 2L, 3L), file, "safe/ints")
h5_typeof(file, "safe/ints")
#> [1] "uint8"
# Integer vector WITH missing values -> Promoted to float64
h5_write(c(1L, NA, 3L), file, "safe/ints_na")
h5_typeof(file, "safe/ints_na")
#> [1] "float64"Character Vectors (Strings)
HDF5 supports two primary methods for storing strings: Variable-Length and Fixed-Length.
Automatic Type Selection
By default (as = "auto"), h5lite chooses
the most efficient string representation:
- If the vector contains
NA, it uses Variable-Length UTF-8 (which natively supports missing values). - If there are no missing values and the strings are relatively short and consistent in length, it uses Fixed-Length UTF-8 to allow for compression and faster access.
Variable-Length
You can explicitly request variable-length storage using
as = "utf8" or as = "ascii".
-
Pros: Most flexible; exact memory usage per string;
supports
NA(stored as NULL pointers). - Cons: Cannot be compressed using standard HDF5 filters; slower to read/write for extreme dataset sizes.
Compression
Compression in HDF5 requires the dataset to be “chunked”.
h5lite handles chunking parameters automatically when you
enable compression.
You can enable compression using the compress
argument:
-
compress = TRUE(default): Uses zlib (deflate) level 5. -
compress = 9: Uses zlib level 9 (max compression, slower).
64-bit Integers
R does not natively support 64-bit integers, but the
bit64 package provides an integer64 class.
h5lite supports reading and writing these types directly to
HDF5 int64.
if (requireNamespace("bit64", quietly = TRUE)) {
val <- bit64::as.integer64(c("9223372036854775807", "-9223372036854775807"))
h5_write(val, file, "huge_ints")
in_val <- h5_read(file, "huge_ints")
print(class(in_val))
}
#> [1] "numeric"