Data Types & Compression

h5lite is designed to seamlessly map R’s diverse data structures to HDF5’s portable format. This vignette explains the supported R data types, how h5lite writes them to HDF5, and how you can precisely control data types and compression when needed.

library(h5lite)
file <- tempfile(fileext = ".h5")

Supported Data Types

h5lite supports reading and writing a wide range of R data types. The table below lists the default mapping when writing to HDF5.

R Data Type	HDF5 Equivalent	Description
Numeric	variable	Selects optimal type: `uint8`, `float32`, etc.
Logical	`H5T_STD_U8LE`	Stored as 0 (FALSE) or 1 (TRUE) (`uint8`).
Character	`H5T_STRING`	Variable or fixed-length UTF-8 strings.
Complex	`H5T_COMPLEX`	Native HDF5 2.0+ complex numbers.
Raw	`H5T_OPAQUE`	Raw bytes / binary data.
Factor	`H5T_ENUM`	Integer indices with label mapping.
integer64	`H5T_STD_I64LE`	64-bit signed integers via `bit64` package.
POSIXt	`H5T_STRING`	ISO 8601 string (`YYYY-MM-DDTHH:MM:SSZ`).
List	`H5O_TYPE_GROUP`	Recursive container structure.
Data Frame	`H5T_COMPOUND`	Table of mixed types.
NULL	`H5S_NULL`	Creates a placeholder.

Dimensions: Scalars, Vectors, and Arrays

Atomic data types (Integer, integer64, Double, Logical, Character, Complex, Raw, and POSIXt) can be written to HDF5 as scalars, 1D vectors, or N-dimensional arrays.

Scalars: To write a single value as a true HDF5 scalar (0 dimensions), you must wrap the value in I().
Vectors: Standard R vectors are written as 1D arrays (Simple Dataspace with rank 1).
Arrays/Matrices: R objects with dim attributes are written as N-dimensional datasets, preserving their shape.

# 1. Scalar (0 dims)
h5_write(I(42), file, "structure/scalar")

# 2. Vector (1 dim)
h5_write(c(1, 2, 3), file, "structure/vector")

# 3. Matrix (2 dims)
h5_write(matrix(1:9, 3, 3), file, "structure/matrix")

For more complex dimensional structures, refer to vignette('matrices').

Numeric Data

R uses 32-bit integers and 64-bit doubles. When writing with as = "auto", h5lite analyzes the range of your data to select the most compact HDF5 type.

Default: Selects optimal type based on range of values.
With NA: float64 (H5T_IEEE_F64LE)
Fractional Values: Double-precision vectors with fractional values default to float64.
Coercion: You can override this using int[8|16|32|64], uint[8|16|32|64], float[16|32|64], or bfloat16.

# Integers between 0 and 255 (uint8)
h5_write(c(1L, 2L, 3L), file, "integers/small")

# Integers with NA -> float64
h5_write(c(1L, NA, 3L), file, "integers/with_na")

# Force larger type (int16)
h5_write(1:100, file, "integers/short", as = "int16")

64-bit Integers (`integer64`)

Default: int64 (H5T_STD_I64LE)
Coercion: none

R does not natively support 64-bit integers, but h5lite supports reading and writing them via the bit64 package.

if (requireNamespace("bit64", quietly = TRUE)) {
  val <- bit64::as.integer64(c("9223372036854775807", "-9223372036854775807"))
  h5_write(val, file, "integers/int64")
}

Double (Numeric) Data

R’s default numeric type is double-precision.

Default: float64 (H5T_IEEE_F64LE)
Coercion: int[8|16|32|64], uint[8|16|32|64], float[16|32|64], or bfloat16

data <- rnorm(10)

# Default (float64)
h5_write(data, file, "doubles/default")

# Single Precision (float32) - Saves 50% space
h5_write(data, file, "doubles/float32", as = "float32")

Logical Data

Default: uint8 (H5T_STD_U8LE)
With NA: float64 (H5T_IEEE_F64LE)
Coercion: int[8|16|32|64], uint[8|16|32|64], float[16|32|64], or bfloat16

bools <- sample(c(TRUE, FALSE), 1000, replace = TRUE)

h5_write(bools, file, "logicals/packed")

Character Data

HDF5 supports two methods for storing strings. By default (as = "auto"), h5lite chooses the best approach:

Variable-Length: Used if the vector contains NA or if string lengths are highly inconsistent.
Fixed-Length: Used for short, consistent strings without NA to allow for compression.

Variable-Length:

Explicitly requested with as = "utf8" or as = "ascii".

Compressible: NO
Handles NA: YES

# UTF-8 variable length
h5_write(c("apple", "banana", NA), file, "strings/var_utf8")

# ASCII variable length
h5_write(c("A", "B", "C"), file, "strings/var_ascii", as = "ascii")

Fixed-Length:

Use as = "ascii[10]" / as = "utf8[10]" (explicit size=10) or as = "ascii[]" / as = "utf8[]" (auto-detect max length).

Compressible: YES
Handles NA: NO

# UTF-8 auto-detected fixed length
h5_write(c("apple", "banana"), file, "strings/fixed_utf8")

# ASCII fixed length (1 byte)
h5_write(c("A", "B", "C"), file, "strings/fixed_ascii", as = "ascii[1]")

Technical Note: h5lite uses H5T_C_S1 for all strings, and H5T_STR_NULLTERM for all fixed length strings.

Dates and Times (`POSIXt`)

R date-time objects (POSIXct / POSIXlt) are stored as Strings in ISO 8601 format (YYYY-MM-DDTHH:MM:SSZ). This ensures maximum portability with other languages and HDF5 tools that do not share R’s specific epoch-based integer representation.

now <- Sys.time()
h5_write(now, file, "datetime/iso8601")

Complex Data

R complex numbers are written using the new complex floating-point type introduced in HDF5 2.0.0 (H5T_COMPLEX_IEEE_F64LE).

Compatibility Warning: This data type for complex numbers is a feature specific to HDF5 version 2.0+. Datasets written with this type generally cannot be read by HDF5 readers built against older versions of the library (e.g., HDF5 1.10 or 1.12). Ensure that any downstream tools or libraries used to read these files are updated to support HDF5 2.0 standards.

comp <- c(1+2i, 3+4i)
h5_write(comp, file, "complex_data")

Raw Data

Raw vectors (bytes) are stored as HDF5 OPAQUE types. This is ideal for storing binary blobs, images, or serialized objects where you need to preserve the exact byte sequence without interpretation.

raw_vec <- as.raw(c(0x01, 0xFF, 0x1A))
h5_write(raw_vec, file, "binary_blob")

Factors

R Factors are stored as HDF5 ENUM types. This maps the integer codes to the factor levels (labels) efficiently within the file header, ensuring the labels are preserved without duplicating string data for every element.

fac <- factor(c("low", "high", "medium", "low"))
h5_write(fac, file, "categorical")

Lists

R lists are mapped to HDF5 Groups. Since lists are recursive containers, h5lite walks the list and creates a dataset (or subgroup) for every element found. You can use as = c("element_name" = "skip") to exclude specific items.

my_list <- list(data = 1:100, meta = list(valid = TRUE))
h5_write(my_list, file, "types/list")

Data Frames

Data Frames are stored as HDF5 Compound types (tables). This ensures that rows are kept together in memory. You can use the as argument to specify the type of individual columns.

For a comprehensive guide, see vignette('data-frames').

df <- data.frame(
  id = 1:5,
  score = c(10.5, 20.2, 15.0, 9.8, 30.1)
)

# 1. 'id' coerced to uint16
# 2. 'score' coerced to float32
h5_write(df, file, "types/dataframe", as = c(
  "id"    = "uint16",
  "score" = "float32"
))

NULL

The NULL object in R is mapped to a dataset with a NULL Dataspace (H5S_NULL). This creates a dataset that exists in the file structure but contains no data elements and consumes no storage space.

h5_write(NULL, file, "placeholders/empty_slot")

Compression

HDF5 supports transparent data compression using the zlib (gzip) and szip algorithms. You can control the compression behavior using the compress argument.

"gzip-5" (default): Standard zlib compression at level 5. Levels "gzip-1" through "gzip-9" are also supported. Safe and universally compatible.
"szip-nn": Szip with Nearest Neighbor coding. Best for continuous, correlated, or floating-point data (e.g., time series or smooth gradients).
"szip-ec": Szip with Entropy Coding. Best for uncorrelated, discrete, or categorical integer data.
"none": Disables compression entirely.

# Maximum zlib compression
h5_write(rnorm(1000), file, "data/max", compress = "gzip-9")

# Szip Entropy Coding for discrete integer data
h5_write(sample(1:5, 1000, replace = TRUE), file, "data/szip", compress = "szip-ec")

The Shuffle Filter

When gzip compression is enabled, h5lite automatically applies the HDF5 Byte Shuffle Filter before the data is compressed. The Shuffle Filter does not compress data itself; rather, it rearranges the byte stream to make it more compressible by zlib.

It works by separating the bytes of each value by their significance. For example, in a 4-byte integer array:

All the 1st bytes (least significant) are grouped together.
All the 2nd bytes are grouped together.
And so on.

Why this helps:

Integers: Small integers often have many zero-padding bytes. The shuffle filter groups these zeros into long runs, which zlib compresses extremely efficiently. This allows int32 data to compress nearly as well as int8 data if the values are small.
Doubles: Floating point numbers often share the same exponent bytes if they are in a similar range. The shuffle filter groups these identical exponent bytes, creating repetitive patterns that zlib can compress.