Pangeo Showcase: "HDF5 at the Speed of Zarr"

DOI

Title: “HDF5 at the Speed of Zarr”
Invited Speaker: Luis Lopez (ORCID:0000-0003-4896-3263)
When: Wednesday March 13, 4PM EDT
Where: Launch Meeting - Zoom
Abstract: As flexible and powerful as HDF5 can be, it comes with big tradeoffs when it’s accessed from remote storage systems, mainly because the file format and the client I/O libraries were designed for local and supercomputing workflows. As scientific data and workflows migrate to the cloud, efficient access to data stored in HDF5 format is a key factor that will accelerate or slow down “science in the cloud” across all disciplines.

We have been working on testing implementation of recently available features in the HDF5 stack that results in performant access to HDF5 from remote cloud storage. This performance seems on par with modern cloud-native formats like Zarr but with the advantage of not having to reformat the data or generate metadata sidecar files (DMR++, Kerchunk).

  • 20 minutes - Community Showcase
  • 40 minutes - Showcase Discussion/Community Check-ins
3 Likes

Here is a list of useful links for this presentation:

Slides

H5Cloud project repository
IS2 cloud access
Discussion on HDF performance

3 Likes

Just watching your talk, sorry I missed it. Thanks for doing this!

“Access patterns matter” - couldn’t agree more.

All HDF5 files ought to be written in “consolidated” form with the metadata up front! :slight_smile:

Some thoughts:

  • does your “cloud optimised” also include rechunking the data for some preferred use case?
  • “first” is always the recommended cache pattern for HDF5 in any of the cloud FS backends. This is what kerchunk, for instance, uses all the time. There has been an argument that fsspec should try to choose the right cache per file type, but “readahead” is still the best for stream rather than random access. Also, the cache size could be bigger, but that is also usage dependent.
  • I wonder if you tried tuning anything for benchmarking with kerchunk (of course I have extra knowledge here)
  • in your position, I would make some explicit comparison of the pros/cons of rewriting the data in C.O.HDF5 versus rewriting in zarr

Hi Martin, thanks for watching!

does your “cloud optimised” also include rechunking the data for some preferred use case?

For the tests we ran, we did not alter the original chunking of the data. It’s definitely a factor when we optimize access. It’s a tricky question right, too small and we get a lot of I/O, too big and we run into other issues like memory and decompression speeds.

“first” is always the recommended cache pattern for HDF5 in any of the cloud FS backends.

I think this is really important, I wonder if packages like earthaccess, pystac-client, xarray, etc can take the recommended and transform it into enforced. Almost nobody knows why this is important for cloud access.

Kerchunk

Not formally, I noticed that Kerchunk was way faster with cloud-optimized HDF5 but for that I needed to tell it how to open the file (cache first, size of the buffer = size of the metadata) I forked kerchunk and modified it a bit to allow this.

Re-writing in Zarr vs CO-HDF5

I think, (personal opinion) that if our data model fits in Zarr, we should use it! However, in the context of remote sensing data, there are plenty of mission requirements that make it hard to depart from HDF. In these cases re-writing CO-HDF5 could be a better bang for the buck. I think that comparison needs to be written, perhaps in the form of a short technical paper…

I would say there is no “right”, only optimal cases for each access pattern. There are, however, some very wrong cases :slight_smile:

earthaccess, pystac-client, xarray, etc can take the recommended and transform it into enforced .

I don’t see why they shouldn’t have decent defaults, allowing users to override them if necessary.

I noticed that Kerchunk was way faster with cloud-optimized HDF5

I meant for final reading of the dataset, not the scanning phase - but this can be important too

I forked kerchunk and modified it a bit to allow this.

Feel free to share/propose any PR. I wonder, is there a way to know the size of the metadata area, i.e., the bit it’s worth caching, before starting to scan a file?