Pangeo Showcase: "HDF5 at the Speed of Zarr"

chuckwondo · June 24, 2024, 6:43pm

Without getting into much detail, the following chart shows how the various combinations of cache_type, block_size (in MiB), and fill (bool) arguments to s3fs.S3FileSystem.open affect read performance of .h5 files.

I gathered these metrics in the context of running an algorithm for subsetting a set of .h5 GEDI L4A files in basic Python multiprocessing code (i.e., no use of any specific libraries for scaling, such as dask, ray, etc.). The gist of the code is that it takes a list of .h5 files in S3 (in this case, ~1200), and for each file it reads (via h5py) a handful of datasets (the same for each file) out of the dozens available, doing so across 32 CPUs, each with 64GiB of RAM (far more than actually needed).

We also compare this to simply downloading the files and reading them from the local filesystem, rather than reading directly from S3. This case is where the y-axis tick is labeled ('download', 0.0, False). I ran each combination ~30 times each (some combos had 1 or 2 jobs fail).

As you can see, all uses of cache type first perform significantly worse than anything else, including downloading. Of course, the .h5 files are not cloud-optimized, so nothing performs more than marginally better than downloading, but that’s sort of the point.

I’m no expert on any of this stuff, but I’m happy to provide more details if anybody has questions.

Topic		Replies	Views
Recommendation for hosting cloud-optimized data Data	15	2875	January 21, 2022
Webinar on cloud-optimized HDF reading	2	434	May 6, 2021
OPeNDAP vs. direct file access Data	32	4564	January 27, 2021
S3 - Zarr / NetCDF access times using s3fs Data	13	3597	April 19, 2023
Suggested database for large amount of NetCDF data Data	13	3075	April 7, 2022

Pangeo Showcase: "HDF5 at the Speed of Zarr"

Related topics