Thinking out-loud after attending ESIP and the USGEO Data Management Working Cloud Cloud Recommendations meeting this week - the question of which is the ‘best’ cloud optimized data format keeps coming up. It’s repeated there is no ‘best’ and that it’s really use case specific - but it would be great to have a thread to document what those use cases actually are - especially as we now receive questions of why zarr when kerchunk exists, and would there ever be a mission that uses zarr as its native format. I am thinking of this from a data provider perspective and not just in terms of performance, but in terms of costs associated with potential hosting duplicate archives in different formats for different use cases.
it would be great to have a thread to document what those use cases actually are
Great question!
Our use-case is training machine learning models on satellite data and numerical weather predictions. We need to be able to read random crops of the training data very rapidly. For example, each ML training example might be 8 timesteps x 128 pixels x 128 pixels x 4 channels of satellite data. And each example will be taken from a random location in space and time, sampled uniformly from the entire dataset (which might be on the order of 100 TB in size). To keep our GPUs fed with data, we need to load on the order of 500 examples per second (i.e. 500 random crops, each of size 8 x 128 x 128 x 4). That’s not that much bandwidth (maybe 250 MB/sec, assuming each pixel is 8 bits, and assuming little “wasted” data is read). But it’s a lot of IO operations per second.
Our model runs produce a collection of netCDF files that each contain a fixed number of simulation days of data. The chunk size is usually 1 in the time dimension (e.g. each simulation hour), with perhaps a week of data stored in each file. And there can be hundreds or thousands of files.
Our current approach (mostly planned at this point) is to:
Use kerchunk to create a virtual Zarr dataset from the collection of NetCDF files. This is nicer than open_mfdataset because it only needs to be done once, and doesn’t have memory issues.
Split the rechunked Zarr dataset back into NetCDF files that each have one chunk in the time dimension (e.g. one week of data per chunk. Although writing each file is a serial operation, the writing of all the weekly files is a parallel operation)
Kerchunk the rechunked NetCDF files
Distribute the rechunked NetCDF files along with the kerchunk-generated reference files written in Parquet
This may seem a bit involved on the provider side, but it’s nice from the user side:
The performance of reading the virtual dataset pointing to NetCDF files with the Zarr libary is the same as if the Zarr format was used.
Users without the Zarr library can use tools that ingest the rechunked NetCDF files
Thanks for sharing your workflow. Do you have any more detailed documentation of this workflow? I’d be interested to know whether we might be able to copy it in our workflows at CEDA (UK).
Thanks for opening this thread @briannapagan. I think this sort of discussion is especially timely given the spin up of CMIP7.
For that particular use case it seems that a full zarr mirror of the data is not realistic (since it would basically double the amount of data). We have been discussing two options (within our CMIP6 pangeo working group and within the CMIP7 Data Access Task Team):
Creating a zarr mirror of only ‘the most important’ dataset, which would probably eliminate many ‘niche’ science use cases from using the cloud-optimized data.
Recommending a stricter checking of e.g. the chunking structure of the postprocessed netcdf files. (This is effectively what @rsignell is achieving in his example if I understand correctly). Ideally the ‘kerchunking’ is then done by the data provider too, but it is not strictly necessary. E.g. if those files are hosted in a ‘cloud-native’ form with an S3-like interface, someone else could do the work of creating the kerchunk reference files, which would presumably be much less work/resource intensive than transforming data to mirrored zarr stores.
I think that the second option is much more desirable. From what I can see now this would give a maximum benefit to the user (as Rich mentioned), and be conservative on resources, with the added cost of checking/postprocessing on the data provider side. From my end this seems like a good trade off, but I am curious to hear if this seem like a realistic solution from the data providers side of things.
I should mention that deciding on an ‘optimal chunking structure’ is a key of this approach, but is non-trivial due to the many different use cases requiring different chunking schemes.
@jbusecke - very much agree on this. Getting the chunking scheme right for us is a challenge that seems to involve a bit of experimentation and some heuristics.
The size and volume of your CRM data enrichment can influence the choice of format. Smaller datasets may not benefit as much from complex optimizations as larger ones. Consider the trade-off between format complexity and the size of your data. Some formats are better suited for specific types of data structures. For example, if you are dealing with multi-dimensional arrays, Zarr and HDF5 might be more appropriate, while GeoJSON or Parquet could be better for tabular data.
By documenting the specific use cases, performance characteristics, and cost implications of different formats, you can provide valuable guidance to data providers and users. This documentation can help organizations make informed decisions about which formats to adopt based on their unique requirements and constraints. Additionally, it can promote transparency and collaboration within the community, ultimately leading to better data management practices.