I’m been trying to learn how forecast data is stored and queried (xref Webinar: Analysis Ready Weather Forecast Data Cubes with Zarr).
Many of the examples on this forum describe kerchunk references to the underlying GRIB files. This is cool, but won’t work well for point-wise timeseries queries.
So… how are people organizing / chunking their copies of forecast data today?
6 Likes
Thanks for sharing @dcherian , really cool work across these threads. Can you say a little more about why those clever kerchunk reference tricks don’t work well for point-wise timeseries?
I don’t have a good answer here, but very interested in this area and happy to share what we are doing at present. I help run an ongoing ecological forecast challenge (see NEON Ecological Forecast Challenge - Forecasting Challenge) predicting a handful of ecological variables measured at ~ 80 sites around the US, where we have simply downscaled the GEFS ensemble forecasts to (geo)parquet. We’ve found this works reasonably well for timeseries queries (forecasts of course have two notions of time – the time for the prediction and the forecast horizon) as long as we can avoid creating too many small partitions of parquet files. Obviously reducing to sites basically means we’re not dealing with continuous notion of space though so doesn’t really address the issue here!
our more spatially explicit ecological forecasts have been small scale, where we are just using a typical COG + STAC pattern. simple but obviously this doesn’t scale particularly well for large ensembles…
not saying these are good scalable solutions, just what we’re doing at present. if they have a virtue it is their relative simplicity and familiarity but only because I’m still a total novice in kerchunk & zarr …
1 Like
This sounds like it’ll be an interesting thread!
At Open Climate Fix, we forecast solar PV power (and a little bit of wind power). We generally give our ML models a patch of NWP and satellite data, centered over the site of interest. For example, we might use a 24 x 24 square of NWP data (in the x and y dimensions), centered over the site of interest.
We generally convert all our data to Zarr first.
Before ML training, we create the exact batches that we’ll train our ML model on, and save these as NetCDF files. But this process sucks for a range of reason (which I go into in detail in this blog post: Helping to speed up Zarr).
If you haven’t read it already, this discussion might be of interest: Please share your use-cases for Zarr (to help inform benchmarking) · zarr-developers/zarr-python · Discussion #1486 · GitHub
3 Likes
I’m late to the party, but two examples for structuring weather data come to mind.
First is weatherbench2: WeatherBench 2 Data Guide — WeatherBench 2 documentation
Second, which I only discovered today, is this project from NVIDIA: Data Movement — Earth2Studio 0.2.0a0 documentation
Both follow very similar conventions, if not the same.
1 Like
On chunking: yeah, there’s an inherent tradeoff between organizing the data by space vs time. We did some experiments on a related project only to find that you can’t support both use cases well at the same time (chunking schema for analysis-ready? · Issue #12 · google-research/arco-era5 · GitHub). We took a different approach to support the timeseries-style query use case.
If data is in Zarr, I expect the consumer to be an ML algorithm. The model specifics determine the spatial chunking particulars; a nowcast will use a local patch whereas a midrange forecast would probably use the whole globe. Though, if the users are handling analytics queries, then being able to scan across long ranges of time (and small areas) will be common.
I like Jack’s approach of 24 x 24 tiles as a happy middle. FWIW, Google Earth Engine seems to use tiles of 256 x 256 pixels (https://www.sciencedirect.com/science/article/pii/S0034425717302900#:~:text=Images%20ingested%20into,issuing%20additional%20reads).
1 Like
Thanks all.
It’s interesting to me that everyone is choosing the “data cube” model rather than the group-centric model proposed in Xarray and collections of forecasts. Indeed, that’s what I chose too. it works well for write time, but clearly has issues for some time-centric read queries.
To that end, I wrote a custom Xarray index to help with that “FMRC”-style indexing (but does not solve the performance challenges). Followup post coming soon!
2 Likes