CMIP6 Zarr datasets on AWS — useful for interactive exploration?

guigrpa · June 10, 2021, 10:31am

Hi there,

I’m working on the development of new, interactive ways to exploit cloud-based Zarr datasets, and obviously got very excited when a lot of CMIP6 datasets were released on Amazon (AWS) in that format (cmip6-pds bucket).

However, looking closely into the ASDI CMIP6 buckets, I’ve seen that the datasets are only chunked across the time dimension and are quite large (10-100 MB). This makes fast, interactive analysis very hard, since (as far as I know) obtaining a time series for a single location would require downloading all chunks, even in Python with xarray; from JavaScript, it would be even more of a show-stopper.

Maybe we’re missing something? We thought about range header requests (à la COG), but (1) I’m not sure they’re supported in Zarr or Zarr libraries; and (2) for geographic subsetting (say we want a small AOI) we would still be sending a lot of requests (one for each lat coordinate, since they won’t be contiguously stored in the file).

Any ideas? Thanks in advance!

rabernat · June 10, 2021, 12:11pm

Thanks for this interesting and useful question @guigrpa – and welcome to the forum!

I’ll try to write a detailed response within a few days. In the meantime, this other post contains many points that are relevant to your question:

Topic		Replies	Views
Xarray and compression options for large NetCDF files Data	8	3727	March 8, 2022
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? HPC	43	17340	January 8, 2025
The National Water Model Reanalysis Zarr dataset on AWS	7	2658	April 19, 2023
Recommendation for hosting cloud-optimized data Data	15	2728	January 21, 2022
Cloud-optimized Eulerian+Lagrangian dataset freely accessible News & Announcements	1	134	September 11, 2024

CMIP6 Zarr datasets on AWS — useful for interactive exploration?

Related topics