Title: “Optimizations for Kerchunk aggregation and Zarr I/O at scale for Machine Learning”
Invited Speaker: David Stuebe(ORCID:0009-0000-2804-7191 ), Camus Energy
When: Wednesday March 6, 12PM EST
Where: Launch Meeting - Zoom
Abstract: We have recently contributed enhancements that make working with NODD GRIB weather forecasts more efficient at scale. By sharing this work with the Pangeo community we hope that folks will both find benefit and help advocate for these enhancements to be enabled in a more generalized way.
asascience-open:main
← emfdavid:grib_index_aggregation
opened 03:17AM - 02 Feb 24 UTC
# Grib Index Aggregations
The functions in this module allow building kerchun… k aggregations of NODD grib2 weather forecasts fast.
The module supports a 3 step process
1. Extract and persist metadata directly from a few arbitrary grib files for a given product such as HRRR SUBH
2. Use the metadata mapping to build an index table of every grib message from the idx text files
3. Combine the index data with the metadata to build any FMRC slice (Horizon, RunTime, ValidTime, BestAvailable)
Once the metadata is created for the grib files from one complete forecast run (for instance, 48 hourly files from the 00Z HRRR SFC product), it takes less than a minutes to index a whole year of forecasts in a single python process - no parallelism required. This speeds up building the aggregations. It does not speed up reading the data (that is next).
A [juptyer notebook](https://gist.github.com/emfdavid/89516e3f04bd46cf283f27ec0f22eeda) provides a brief demonstration of the capability.
Camus Energy is using this operationally with GEFS, GFS and HRRR grib2 files, available on NODD hosted cloud storage buckets. There is no requirements file or docker file included in this PR. There are extensive tests that can be shared later. To run the code you must install [kerchunk](https://github.com/fsspec/kerchunk) from github as the [grib_tree](https://github.com/fsspec/kerchunk/pull/399) code is not in the version 2.2 release.
This excerpt of our production code is a prototype for the community discussion that we hope can move into Kerchunk.
emfdavid:r2.17.0
← emfdavid:parallel_chunk_getitems
opened 01:52PM - 27 Feb 24 UTC
# Prototype Parallel get_chunkitems
This is a draft implementation that choos… es a particular parallel framework suitable for my application. This is not a generalized solution that could be merged into zarr... but it does show the potential of allowing parallelism at this critical point in zarr core.
TODO:
* [ ] Add unit tests and/or doctests in docstrings
* [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
* [ ] New/modified features documented in docs/tutorial.rst
* [ ] Changes documented in docs/release.rst
* [ ] GitHub Actions have all passed
* [ ] Test coverage is 100% (Codecov passes)
20 minutes - Community Showcase
40 minutes - Showcase discussion/Community check-ins
3 Likes
Forgot to mention during that talk that one of the key advantages of the parallel_chunk_getitems implementation is that it is fault tolerant, returning the fill value if there is a corrupted grib file.
2 Likes
A version of @emfdavid ’s approach is now documented in the Kerchunk docs: Aggregation special cases — kerchunk documentation
(I can’t take any credit! I didn’t work on this documentation! I’m just linking it for anyone who stumbles on this Pangeo thread in the future…)
3 Likes