Pangeo Showcase: "Optimizations for Kerchunk aggregation and Zarr I/O at scale for Machine Learning"

rsignell · March 4, 2024, 8:53pm

Title: “Optimizations for Kerchunk aggregation and Zarr I/O at scale for Machine Learning”
Invited Speaker: David Stuebe(ORCID:0009-0000-2804-7191), Camus Energy
When: Wednesday March 6, 12PM EST
Where: Launch Meeting - Zoom
Abstract: We have recently contributed enhancements that make working with NODD GRIB weather forecasts more efficient at scale. By sharing this work with the Pangeo community we hope that folks will both find benefit and help advocate for these enhancements to be enabled in a more generalized way.

github.com/asascience-open/nextgen-dmac

Add dynamic_zarr_store module

asascience-open:main ← emfdavid:grib_index_aggregation

opened 03:17AM - 02 Feb 24 UTC

emfdavid

+163395 -0

# Grib Index Aggregations The functions in this module allow building kerchun…k aggregations of NODD grib2 weather forecasts fast. The module supports a 3 step process 1. Extract and persist metadata directly from a few arbitrary grib files for a given product such as HRRR SUBH 2. Use the metadata mapping to build an index table of every grib message from the idx text files 3. Combine the index data with the metadata to build any FMRC slice (Horizon, RunTime, ValidTime, BestAvailable) Once the metadata is created for the grib files from one complete forecast run (for instance, 48 hourly files from the 00Z HRRR SFC product), it takes less than a minutes to index a whole year of forecasts in a single python process - no parallelism required. This speeds up building the aggregations. It does not speed up reading the data (that is next). A [juptyer notebook](https://gist.github.com/emfdavid/89516e3f04bd46cf283f27ec0f22eeda) provides a brief demonstration of the capability. Camus Energy is using this operationally with GEFS, GFS and HRRR grib2 files, available on NODD hosted cloud storage buckets. There is no requirements file or docker file included in this PR. There are extensive tests that can be shared later. To run the code you must install [kerchunk](https://github.com/fsspec/kerchunk) from github as the [grib_tree](https://github.com/fsspec/kerchunk/pull/399) code is not in the version 2.2 release. This excerpt of our production code is a prototype for the community discussion that we hope can move into Kerchunk.

github.com/emfdavid/zarr-python

Parallel chunk getitems

emfdavid:r2.17.0 ← emfdavid:parallel_chunk_getitems

opened 01:52PM - 27 Feb 24 UTC

emfdavid

+57 -0

# Prototype Parallel get_chunkitems This is a draft implementation that choos…es a particular parallel framework suitable for my application. This is not a generalized solution that could be merged into zarr... but it does show the potential of allowing parallelism at this critical point in zarr core. TODO: * [ ] Add unit tests and/or doctests in docstrings * [ ] Add docstrings and API docs for any new/modified user-facing classes and functions * [ ] New/modified features documented in docs/tutorial.rst * [ ] Changes documented in docs/release.rst * [ ] GitHub Actions have all passed * [ ] Test coverage is 100% (Codecov passes)

20 minutes - Community Showcase
40 minutes - Showcase discussion/Community check-ins

emfdavid · March 6, 2024, 6:23pm

Forgot to mention during that talk that one of the key advantages of the parallel_chunk_getitems implementation is that it is fault tolerant, returning the fill value if there is a corrupted grib file.

jack_kelly · September 27, 2024, 10:22am

A version of @emfdavid’s approach is now documented in the Kerchunk docs: Aggregation special cases — kerchunk documentation

(I can’t take any credit! I didn’t work on this documentation! I’m just linking it for anyone who stumbles on this Pangeo thread in the future…)

Topic		Replies	Views
September 21th 2022: Accessing NetCDF and GRIB file collections as cloud-native virtual datasets using Kerchunk Pangeo Showcase	0	1202	September 19, 2022
Pangeo Showcase: "Zarr Python 3 and beyond" (March 05, 2025) Pangeo Showcase	3	288	March 5, 2025
Pangeo Showcase: "VirtualiZarr: Create virtual Zarr stores using xarray syntax" Pangeo Showcase	1	895	May 15, 2024
How are you organizing your weather forecast datasets? Data forecast	5	683	May 31, 2024
Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray" Pangeo Showcase machine-learning	6	929	October 25, 2024

Pangeo Showcase: "Optimizations for Kerchunk aggregation and Zarr I/O at scale for Machine Learning"

Related topics