Introduction
The data.frame is R’s primary data structure for tabular
data, containing columns of potentially different types.
h5lite provides first-class support for
data.frame objects by mapping them to a native HDF5
structure called a compound dataset.
This vignette explains how data.frame objects are
written and read, and provides technical details on the underlying HDF5
implementation.
For details on other data structures, see
vignette("atomic-vectors") and
vignette("matrices").
1. Writing and Reading Data Frames
Writing a data.frame is a one-line command with
h5_write().
my_df <- data.frame(
trial = 1:4,
sample_id = c("A1", "A2", "B1", "B2"),
value = c(10.2, 11.1, 9.8, 10.5),
pass_qc = c(TRUE, TRUE, FALSE, TRUE),
condition = factor(c("control", "treat", "control", "treat"))
)
h5_write(file, "my_df", my_df)You can inspect the object with h5_str() and
h5_class(). Notice that h5lite correctly
identifies it as a data.frame backed by a
compound HDF5 type.
Reading the data back with h5_read() restores it as an R
data.frame.
read_df <- h5_read(file, "my_df")
str(read_df)
#> 'data.frame': 4 obs. of 5 variables:
#> $ trial : num 1 2 3 4
#> $ sample_id: chr "A1" "A2" "B1" "B2"
#> $ value : num 10.2 11.1 9.8 10.5
#> $ pass_qc : num 1 1 0 1
#> $ condition: Factor w/ 2 levels "control","treat": 1 2 1 2Data Type Fidelity
h5lite aims for a high-fidelity round-trip, but there
are two important conversions to note:
-
integer->numeric: All integer columns (trialin our example) are read back asnumeric(double-precision) vectors. This is a safety measure to prevent integer overflow. -
logical->numeric:logicalcolumns (pass_qc) are stored as 8-bit integers (0/1) and are also read back asnumeric.
factor columns, however, are perfectly preserved.
Let’s verify the round-trip by manually converting the original
data.frame to match the expected output.
#> [1] TRUE
# To verify the round-trip, convert integer/logical columns to numeric
my_df_cmp <- my_df
my_df_cmp$trial <- as.numeric(my_df_cmp$trial)
my_df_cmp$pass_qc <- as.numeric(my_df_cmp$pass_qc)
all.equal(read_df, my_df_cmp)Advanced Details: The HDF5 Compound Type
When you write a data.frame, h5lite does
not save each column as a separate dataset. Instead, it creates a single
HDF5 dataset with a compound datatype.
A compound type is analogous to a struct in C. It is a
collection of named members, where each member has its own datatype. For
a data.frame, this structure looks like:
-
HDF5 Dataset: A 1D array, where the length is the
number of rows in the
data.frame. -
HDF5 Datatype: A compound type where:
- Each member of the struct corresponds to a
column of the
data.frame. - The member’s name is the column name.
- The member’s datatype is the HDF5 equivalent of the
R column’s type (e.g.,
H5T_FLOAT64fornumeric,H5T_STRINGforcharacter,H5T_ENUMforfactor).
- Each member of the struct corresponds to a
column of the
This approach has several advantages for HDF5 experts and interoperability:
-
Portability: A compound dataset is a standard,
self-describing HDF5 structure. A Python user with
h5pyor a C++ user can read this dataset and immediately get a structured array or a vector of structs, with all column names and types preserved. - Atomicity: The entire table is a single object in the HDF5 file, which can be easier to manage than a group containing many separate column-datasets.
- Efficiency: For many access patterns, reading a single contiguous block of compound data can be more efficient than reading from multiple disparate datasets.
Preserving data.frame Attributes
Like other R objects, data.frames can have metadata
attached. The most common is row.names. To ensure these are
saved, use attrs = TRUE.
df_with_attrs <- my_df
row.names(df_with_attrs) <- df_with_attrs$sample_id
attr(df_with_attrs, "description") <- "My experiment data"
h5_write(file, "df_with_attrs", df_with_attrs, attrs = TRUE)
# Inspect the HDF5 attributes created
h5_ls_attr(file, "df_with_attrs")
#> [1] "names" "class" "row.names" "description"
# Read back with attributes
read_df_with_attrs <- h5_read(file, "df_with_attrs", attrs = TRUE)
# Manually adjust for type conversions before comparing
df_with_attrs$trial <- as.numeric(df_with_attrs$trial)
df_with_attrs$pass_qc <- as.numeric(df_with_attrs$pass_qc)
all.equal(read_df_with_attrs, df_with_attrs)
#> [1] TRUENote: The
row.namesattribute is read back correctly because of a special rule inh5_read(). It detects the attribute named"row.names", and if it is anumericvector, it is coerced back tointegerto satisfy R’s requirements for a validdata.frame.
# Clean up the temporary file
unlink(file)