I am working on a project that would benefit from performant, cloud-native access to some NetCDF4 datasets that are continuously updated. Let’s take MERRA-2 as an example — that is updated daily.
(That’s intended to run on a NASA internal HPC where the MERRA-2 NetCDFs already exist.)
That creates the initial archive. However, what is the best way to now update the data store? Is there a way to do this with Pangeo Forge tools? Or do I have to fall back on using Xarray’s to_zarr, appending along the time dimension? How will that interact with my custom chunking strategy (which combines several days to improve time series performance)?
Also, a few more general questions:
(1) Does this even make sense as an overall approach? Should I be just grabbing the NetCDFs instead and “cloud-nativizing” them using something like Kerchunk? What would that look like in an operational setting?
(2) Are there any good examples of workflows for producing cloud-native archives quasi-operationally that I could look at for ideas and inspiration? Especially, for gridded, multi-dimensional data?
(3) A similar question: Is there any way to resume an interrupted Pangeo Forge recipe execution step (e.g., if a SLURM job gets killed 2/3 of the way through)? Any examples you could point me to?
Thanks all! Looking forward to hearing your thoughts.
Supporting these workloads is definitely a goal of the Pangeo Forge project. We hope to very soon support the sort of “streaming” style updates that you need for MERRA-2. See
for more context. We are very close.
You will may have to re-process all the data for a particular chunk.
Kerchunk could be helpful here. We are still working out the best way to use it in this context. Please experiment and let us know!
I do not know of any such examples right now. Hopefully we will have some in Pangeo Forge cloud soon.
This is executor dependent. Dask does not support checkpointing. Other executors (Prefect and Beam) do I think.