I am working on a project that would benefit from performant, cloud-native access to some NetCDF4 datasets that are continuously updated. Let’s take MERRA-2 as an example — that is updated daily.
I already have a draft workflow for producing a chunked MERRA-2 data store in Zarr format: veda-data-processing/01-create-zarr.py at main · ashiklom/veda-data-processing · GitHub
(That’s intended to run on a NASA internal HPC where the MERRA-2 NetCDFs already exist.)
That creates the initial archive. However, what is the best way to now update the data store? Is there a way to do this with Pangeo Forge tools? Or do I have to fall back on using Xarray’s
to_zarr, appending along the time dimension? How will that interact with my custom chunking strategy (which combines several days to improve time series performance)?
Also, a few more general questions:
(1) Does this even make sense as an overall approach? Should I be just grabbing the NetCDFs instead and “cloud-nativizing” them using something like Kerchunk? What would that look like in an operational setting?
(2) Are there any good examples of workflows for producing cloud-native archives quasi-operationally that I could look at for ideas and inspiration? Especially, for gridded, multi-dimensional data?
(3) A similar question: Is there any way to resume an interrupted Pangeo Forge recipe execution step (e.g., if a SLURM job gets killed 2/3 of the way through)? Any examples you could point me to?
Thanks all! Looking forward to hearing your thoughts.