@martindurant
I wonder if I’m missing something obvious? I wanted to compare accessing netcdf versus Zarr on S3, both same and different regions than a jupyter hub running the code. I’m looking at AWS GOES-16 data. My notebook is below. Accessing the netcdf is 40x slower than Zarr. Does this sound right? (Too create the Zarr store I simply .to_zarr the netcdf data I had read in, so no re-chunking or anything fancy).
Zarr has been specifically developed to be friendly to this kind of operation: the metadata required to “open” the dataset is stored in a few small fixed files (or only one file, if using consolidated), and the data chunks can be accessed completely parallel, as each is a separate file.
The netCDF format allows for a lot of variance in how the data and metadata are organised, so I can’t say too much, except that it doesn’t completely surprise me that its slow, especially since you are opening multiple files, and must first get the metadata from each of them. The s3fs library has by default a readahead caching model, so if reading metadata involves skipping back and forth in the file, you can fare badly. To debug, you would want to have logging for each seek, read and “fetch” (i.e., s3 range call).
Check out https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314 for some tools specifically to address this issue.
Thanks @bendichter for pointing to our Medium blog post on making NetCDF4/HDF5 files cloud-performant!
You need to have a reasonable chunk size for this approach to work however. If you are lucky enough to have your NetCDF files with chunk sizes around 100MB (or have the ability to rechunk and rewrite them) then indeed you can just use the technique we proposed on existing files.
Unfortunately if the chunking is very small on your NetCDF4/HDF5 files, as it often is, there is no way currently to make them able to compete with appropriately chunked Zarr.
In this case it appears the GOES files on AWS have very tiny chunk sizes, only 226x226 on int16 data variables, or about 0.1MB. (I got this from cell [79] in this notebook by @rabernat - note cells are not in order)
Unless we were able to enhance our software/metadata to read many of these chunks at once, we will not be able to significantly improve the performance using the approach described in the blog. I know @ajelenak is interested in trying to add this capability, but it would require a bit of funding to carry out the work.
@rsignell Thanks for the additional information! Very good to know. I think many HDF5 users are using heuristics to decide their chunking parameters, just choosing something reasonable so they can leverage compression or enable expansion of the dataset. In these cases, they might be able to set their initial write chunk size appropriately for this method if given proper guidance. I don’t know if that applies in this particular case.
@bendichter, yes, we need to providers to write appropriate chunk sizes (e.g. ~100MB). The users can specify a different chunking to tools like xarray, but those provider-specified chunks still need to be read, so the same poor performance will be obtained.
@rsignell yes, I agree, we should try to inform data providers to chunk this way if cloud access will be important. Also, we have an extension of your work here which allows a user to specify pseudo-chunks for contiguous datasets. In that method you can specify whatever chunk size you want. Would be happy to discuss if you are interested.
@bendichter, wait, so you’ve already implemented the pseudo-chunk-reading-multiple-contiguous-chunk approach I mentioned above that we should fund?
@rsignell not exactly, I don’t think. HDF5 lets you store chunked data and it also lets you store contiguous non-chunked datasets. In HDF5, if you store data as contiguous, you can’t use compression, but the advantage is that you can read arbitrary slices of your data without reading any data outside of the slice. Zarr cannot do this, it must read entire chunks (same as HDF5 chunked datasets), which is fine as long as the chunks are reasonably sized. If you have a contiguous dataset and you want to read it with Zarr, the simplest approach is to say this is a dataset that has a single big chunk. The problem with this is that when Zarr reads a chunk of this dataset it will read the whole dataset! To get around this, we made “pseudo chunks,” which allow you to read large contiguous datasets as if they were chunked. These currently only work for contiguous datasets, so it wouldn’t help in cases where the chunks are too small. I hope that makes sense. Happy to jump on a call to discuss further.
See this https://docs.google.com/document/d/118A_htaQIvW0dZwhPBvSJc500ODlTSTRslyB89MLYG8/edit?usp=sharing for a more detailed description and other ways we extended the HDF5 Zarr reader
Note that compression is optional in zarr too, so you could easily read only contiguous intra-chunk sets of bytes if you wanted to (but this is not in the current implementation).
Also, naturally, the idea of slicing in this context only applies to the outermost dimension, and no strides - otherwise you must always read bytes you don’t want.
By the way, s3fs now supports async/concurrent access to whole chunks. In the case that the zarr chunks are small enough that establishing connections is a significant overhead, this has the potential to massively increase the throughput - but the necessary changes are not in zarr yet. The same may be true for fetching (smallish) bytes ranges from HDF5 files on s3.
you could easily read only contiguous intra-chunk sets of bytes if you wanted to (but this is not in the current implementation).
Yes, we could do a standard numpy memmap of large uncompressed chunks- IMHO it would be useful to have this feature in Zarr.
s3fs now supports async/concurrent access to whole chunks
That sounds really interesting! Are there plans to incorporate these features into Zarr as well?
It just requires someone to put in the effort. Zarr has a simple loop over keys when more than one key is required to build a slice - that loop should be replaced by the async-based multi-getter for concurrency.