Title: “HDF5 at the Speed of Zarr”
Invited Speaker: Luis Lopez (ORCID:0000-0003-4896-3263)
When: Wednesday March 13, 4PM EDT
Where: Launch Meeting - Zoom
Abstract: As flexible and powerful as HDF5 can be, it comes with big tradeoffs when it’s accessed from remote storage systems, mainly because the file format and the client I/O libraries were designed for local and supercomputing workflows. As scientific data and workflows migrate to the cloud, efficient access to data stored in HDF5 format is a key factor that will accelerate or slow down “science in the cloud” across all disciplines.
We have been working on testing implementation of recently available features in the HDF5 stack that results in performant access to HDF5 from remote cloud storage. This performance seems on par with modern cloud-native formats like Zarr but with the advantage of not having to reformat the data or generate metadata sidecar files (DMR++, Kerchunk).
- 20 minutes - Community Showcase
- 40 minutes - Showcase Discussion/Community Check-ins
4 Likes
Just watching your talk, sorry I missed it. Thanks for doing this!
“Access patterns matter” - couldn’t agree more.
All HDF5 files ought to be written in “consolidated” form with the metadata up front!
Some thoughts:
- does your “cloud optimised” also include rechunking the data for some preferred use case?
- “first” is always the recommended cache pattern for HDF5 in any of the cloud FS backends. This is what kerchunk, for instance, uses all the time. There has been an argument that fsspec should try to choose the right cache per file type, but “readahead” is still the best for stream rather than random access. Also, the cache size could be bigger, but that is also usage dependent.
- I wonder if you tried tuning anything for benchmarking with kerchunk (of course I have extra knowledge here)
- in your position, I would make some explicit comparison of the pros/cons of rewriting the data in C.O.HDF5 versus rewriting in zarr
Hi Martin, thanks for watching!
does your “cloud optimised” also include rechunking the data for some preferred use case?
For the tests we ran, we did not alter the original chunking of the data. It’s definitely a factor when we optimize access. It’s a tricky question right, too small and we get a lot of I/O, too big and we run into other issues like memory and decompression speeds.
“first” is always the recommended cache pattern for HDF5 in any of the cloud FS backends.
I think this is really important, I wonder if packages like earthaccess, pystac-client, xarray, etc can take the recommended and transform it into enforced. Almost nobody knows why this is important for cloud access.
Kerchunk
Not formally, I noticed that Kerchunk was way faster with cloud-optimized HDF5 but for that I needed to tell it how to open the file (cache first, size of the buffer = size of the metadata) I forked kerchunk and modified it a bit to allow this.
Re-writing in Zarr vs CO-HDF5
I think, (personal opinion) that if our data model fits in Zarr, we should use it! However, in the context of remote sensing data, there are plenty of mission requirements that make it hard to depart from HDF. In these cases re-writing CO-HDF5 could be a better bang for the buck. I think that comparison needs to be written, perhaps in the form of a short technical paper…
I would say there is no “right”, only optimal cases for each access pattern. There are, however, some very wrong cases
earthaccess, pystac-client, xarray, etc can take the recommended and transform it into enforced .
I don’t see why they shouldn’t have decent defaults, allowing users to override them if necessary.
I noticed that Kerchunk was way faster with cloud-optimized HDF5
I meant for final reading of the dataset, not the scanning phase - but this can be important too
I forked kerchunk and modified it a bit to allow this.
Feel free to share/propose any PR. I wonder, is there a way to know the size of the metadata area, i.e., the bit it’s worth caching, before starting to scan a file?
“first” is always the recommended cache pattern for HDF5 in any of the cloud FS backends.
Is this under the assumption that the metadata is “consolidated” up front? If not, it seems odd to me that this would be the recommended cache type for files without consolidated metadata up front.
Is this under the assumption that the metadata is “consolidated” up front?
No; I have found that even without explicit consolidation, the bulk of the metadata tends to be in the first few MB, and the reader will seek there repeatedly. The rest of the metadata may be scattered anywhere amongst the data chunks, so there is no reasonable way to readahead or otherwise cache/pre-fetch for those.
2 Likes
No; I have found that even without explicit consolidation, the bulk of the metadata tends to be in the first few MB, and the reader will seek there repeatedly. The rest of the metadata may be scattered anywhere amongst the data chunks, so there is no reasonable way to readahead or otherwise cache/pre-fetch for those.
In my experiments with GEDI files, where I’m performing subsetting, I’m finding several other cache types to be much better than the “first” cache type. Specifically, I’m finding “all,” “blockcache,” “background,” and “mmap” to all be roughly 30% more performant than “first.”
I’m running more comprehensive tests using various combinations of cache type and block size to gather some performance statistics, so I’m happy to share once I have more numbers.
1 Like
OK, look forward to seeing your results.
I’m running more comprehensive tests using various combinations of cache type and block size to gather some performance statistics, so I’m happy to share once I have more numbers.
This is great news! I was thinking of doing the same with random samples from the NASA DAACs, I think the cache configuration could be even “smarter” if we know the chunking in advance (via kerchunk or dmrpp). cc @TomNicholas @ayushnag
1 Like
Without getting into much detail, the following chart shows how the various combinations of cache_type
, block_size
(in MiB), and fill
(bool) arguments to s3fs.S3FileSystem.open
affect read performance of .h5
files.
I gathered these metrics in the context of running an algorithm for subsetting a set of .h5
GEDI L4A files in basic Python multiprocessing code (i.e., no use of any specific libraries for scaling, such as dask, ray, etc.). The gist of the code is that it takes a list of .h5
files in S3 (in this case, ~1200), and for each file it reads (via h5py
) a handful of datasets (the same for each file) out of the dozens available, doing so across 32 CPUs, each with 64GiB of RAM (far more than actually needed).
We also compare this to simply downloading the files and reading them from the local filesystem, rather than reading directly from S3. This case is where the y-axis tick is labeled ('download', 0.0, False)
. I ran each combination ~30 times each (some combos had 1 or 2 jobs fail).
As you can see, all uses of cache type first
perform significantly worse than anything else, including downloading. Of course, the .h5
files are not cloud-optimized, so nothing performs more than marginally better than downloading, but that’s sort of the point.
I’m no expert on any of this stuff, but I’m happy to provide more details if anybody has questions.
2 Likes
Thanks for this @chuckwondo! I think having the default cache (read-ahead) would be great to complete this chart. Is this on Github?
Sorry @betolink, I forgot to mention that I didn’t include the other cache types because my initial runs had already shown them to perform poorly. The ones I’ve included here, other than “first,” are the ones that showed clearly better performance than the other types, including “first.” I ended up including “first” only because of this discussion, so I could show that it should not necessarily be the recommended cache pattern for HDF5.
However, if you’d like to see proof of the poor performance (in my use case, of course) for “readahead” and some others, I’d be happy to run more jobs to gather stats. Just let me know.
The performance data is not on GitHub, as these tests were run (and the metrics were captured) on a non-public system. However, I should be able to pull together a MRE that should be runnable outside of this system, requiring only Earthdata Login credentials, which is actually an issue for enhancement that we have, so others can run the GEDI Subsetting code outside of our current system.
1 Like
While scanning files, we generally were recommending “first” caching, as the compromise between time and total bytes downloaded. But it depends on the files being scanned!