Best practices for continuously-updating Zarr data store (e.g., MERRA-2)

ashiklom · June 17, 2022, 10:08pm

Hi all!

I am working on a project that would benefit from performant, cloud-native access to some NetCDF4 datasets that are continuously updated. Let’s take MERRA-2 as an example — that is updated daily.

I already have a draft workflow for producing a chunked MERRA-2 data store in Zarr format: veda-data-processing/01-create-zarr.py at main · ashiklom/veda-data-processing · GitHub

(That’s intended to run on a NASA internal HPC where the MERRA-2 NetCDFs already exist.)

That creates the initial archive. However, what is the best way to now update the data store? Is there a way to do this with Pangeo Forge tools? Or do I have to fall back on using Xarray’s to_zarr, appending along the time dimension? How will that interact with my custom chunking strategy (which combines several days to improve time series performance)?

Also, a few more general questions:

(1) Does this even make sense as an overall approach? Should I be just grabbing the NetCDFs instead and “cloud-nativizing” them using something like Kerchunk? What would that look like in an operational setting?

(2) Are there any good examples of workflows for producing cloud-native archives quasi-operationally that I could look at for ideas and inspiration? Especially, for gridded, multi-dimensional data?

(3) A similar question: Is there any way to resume an interrupted Pangeo Forge recipe execution step (e.g., if a SLURM job gets killed 2/3 of the way through)? Any examples you could point me to?

Thanks all! Looking forward to hearing your thoughts.

rabernat · June 22, 2022, 2:04pm

Hi @ashiklom and welcome to the forum!

Supporting these workloads is definitely a goal of the Pangeo Forge project. We hope to very soon support the sort of “streaming” style updates that you need for MERRA-2. See

github.com/pangeo-forge/pangeo-forge-recipes

Support incremental appending

opened 10:11PM - 22 Jan 21 UTC

closed 07:09PM - 28 Aug 23 UTC

rabernat

design question recipe enhancement

Currently, when a recipe is run, it will always cache all of the inputs and writ…e all of the chunks. However, it would be nice to have an option where, if the target already exists, it only write NEW chunks. This raises some design questions. - Currently, the target is never read until we start to execute the recipe (not until the `prepare_target` stage). However, for this to work, the `iter_inputs()` and `iter_chunks()` methods needs to know _which_ inputs and chunks to process. In order to build the pipeline for execution, this information needs to already be inside the recipe object. So this implies that we need open the target in `__post_init__`. Could this cause problems? - How do we align the recipe with the target? For the standard [NetCDFZarrSequential](https://pangeo-forge.readthedocs.io/en/latest/tutorials/netcdf_zarr_sequential.html) recipe, it may be as simple as looking at the length of the sequence dimension: if the target has 100 items but the recipe has 120, we assume the last 20 need to be appended. But are there edge cases to worry about? This intersects a bit with the "versioning" question in #3. If we agree on the answers to the questions above, I think we can move ahead with implementing incremental updates to the `NetCDFZarrSequentialRecipe` class.

for more context. We are very close.

You will may have to re-process all the data for a particular chunk.

Kerchunk could be helpful here. We are still working out the best way to use it in this context. Please experiment and let us know!

I do not know of any such examples right now. Hopefully we will have some in Pangeo Forge cloud soon.

This is executor dependent. Dask does not support checkpointing. Other executors (Prefect and Beam) do I think.

Leo_Lee · September 4, 2023, 4:52am

Hi @ashiklom, thanks for your question! Can I know whether you can continuously append new NetCDF files to zarr and update zarr now?

Topic		Replies	Views
Many netcdf to single zarr store using concurrent.futures Data	6	1423	March 29, 2022
ValueError: cannot reshape array of size 1 into shape (13968,) Data	1	771	October 16, 2023
ZCollection: a library for partitioning Zarr dataset Data	14	1314	April 11, 2022
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1151	April 23, 2022
OPeNDAP vs. direct file access Data	32	4408	January 27, 2021

Best practices for continuously-updating Zarr data store (e.g., MERRA-2)

Related topics