Kerchunk planning

This is an invite for anyone interested in the kerchunk project! What are you hoping we will work on this year, what do you think you might be able to contribute, and which archives/datasets do you think we can target for kerchunking?

Let’s get some thoughts down in this thread (please invite anyone you think might be interested), and we can plan a live discussion to plan out where the project can go next.

10 Likes

@rsignell @maxrjones @jhamman please invite others

I’m interested, I think Kerchunk is super important, and have only relatively recently started using it myself.

Let’s get some thoughts down in this thread

I’m interested in improving the UI of kerchunk as an abstraction for concatenating files. I’ve written my thoughts about refactoring MultiZarrToZarr in this issue (also see Dataclass for "VirtualZarrStore" · Issue #375 · fsspec/kerchunk · GitHub). This also relates to Zarr and Pangeo-forge.

2 Likes

I’d like to have more patterns to combine references from different files. Currently, MultiZarrToZarr is the only way to do the combination, and it is always a (multi-dimensional?) concat.

Some ocean models use separate files (e.g. grid files) for information that is constant in time, which would need a merge instead of a concat to be included.

(this is already possible using custom code on the raw references, so I guess I’m requesting a higher-level API for these)

Edit: this may be a slippery slope, though, since we don’t actually want to reimplement xarray in kerchunk. So I realize we’ll need to draw a line somewhere.

Regarding the current API, I think MultiZarrToZarr currently does too much at once, leading to a somewhat unsatisfactory API. Instead, I believe it would benefit from being split into smaller functions / classes, like concat, merge, or creating 0d variables from attributes / metadata (filepath?).

What I would love, is to let xarray do the combine and somehow persist that information, but it doesn’t seem possible.

We do have some functions for different combine styles, but there would be nothing wrong with providing more patterns. This is particularly important for datasets which are not netCDF-compliant.

This is exactly what the issue I linked above is suggesting. Refactor MultiZarrToZarr into multiple functions · Issue #377 · fsspec/kerchunk · GitHub

If we followed Ryan’s suggestion in that issue to make Zarr arrays concat-able would that be a path towards this? Treat Zarr like a duck-array inside xarray…

(I think we should continue this discussion on that kerchunk issue so everyone else who commented there sees it)

1 Like

I am interested in prototyping kerchunk for CMIP7.
Specifically I would like to come up with a recommendation of how to potentially postprocess/rechunk netcdfs as part of the ESGF publishing requirements so that a kerchunk index of netcdfs on cloud storage provides a “good enough” performance compared to convertibg to zarrs.

Maybe this is something that has already been done though?

Either way if there are specific recommendations we can derive my hope is that we will not have to duplicate all CMIP7 (estimated 100+PB ) as zarrs.

1 Like

I’m definitely curious to see if there are ways to enable kerchunk to use the Rust IO backend I’m hacking away on at the moment. (The ultimate aim would be to enable kerchunk to go even faster!). But it’s still a few months before my Rust stuff is anywhere close to being ready for use :slight_smile:. And I haven’t looked at detail at the kerchunk code so I have no idea if this avenue is even vaguely sensible (no worries if it’s not!)

1 Like

I was half-thinking about rebooting the rust-rs project specifically for handling kerchunk. Of course, rfsspec already exists and showed that IO for remote data could be improved only marginally by using tokio instead of asyncio. Of course, your interests are more for super high bandwidth on local storage; but I am prepared to be proved wrong on remote bytes too.

1 Like

The presentation/advertisement I made for kerchunk’ing GRIB files at AMS this week was pretty well received. I’m following up and gathering feedback from folks to see where there may be opportunities to further improve the GRIB utilities in the library.

I also plan on spending a bit of time this quarter working through some of the MultiZarrToZarr helpers/wrappers I’ve built to see if there are common patterns which would make sense to bring upstream into kerchunk in some way.

The GRIB access seems to be a killer use case for the weather folks at the moment - a way to create ARCO-like and “good enough” cloud datasets for end users without expensive data engineering commitments.

4 Likes

@martindurant has agreed to give the Pangeo Showcase talk on Feb 14 What’s next for Kerchunk - Meta / Pangeo Showcase - Pangeo discussing the current and future state of Kerchunk. Hope you can make it!

8 Likes

Thanks for starting this discussion! From a feature development standpoint, I am most interested in a way to validate that the data have not changed from when kerchunk was used to create a reference file (i.e., checksums) and simplifying/standardizing the API. I also think it would be really valuable for the community to agree on a roadmap for which features will get upstreamed once Zarr V3 is finalized and the “chunk manifest” idea comes to fruition, to avoid the same work happening in multiple places.

I’d also be excited to contribute to a Pangeo data commons for Kerchunk references (or at least a catalog of existing references).

5 Likes

Hello everyone !

I am interested in using Kerchunk for the main project at my work, practically working with massive time series (i.e. 20+ years of half-hourly values stored in daily NEtCDF files, at first just reading data locally). There are quite some things to learn. Thus I decided to create a CLI for it (and a bit more on chunks, see also recent messages in Feature/simple cli for chunking local or remote NetCDF files by steph-ben · Pull Request #319 · fsspec/kerchunk · GitHub).

I would also like to understand how you envision the future of the project. Will it be relevant or is this a transitional period till many functionalities end up in Xarray, Zarr, etc. ?

Kind regards, Nikos

I’m interested in having kerchunk as an on-the-fly caching mechanism for earthaccess, specially for those datasets with very nested data(ICESat-2), I think variable-length chunking is a no go for Zarr but it would be cool if we could be ahead of the curve for when this gets implemented.

1 Like

I’d like to engage on the subject of a zarr-v3 spec compatible manifest spec (one kerchunk could use but may be separable). If such a spec could also include a way to represent concatenation of multiple arrays with varying codecs, that would also be a killer thing to support.

3 Likes

zarr-v3 spec compatible manifest spec

Does this “chunk manifest” proposal exist anywhere in writing yet? I don’t see it under the draft ZEPs.

Does this “chunk manifest” proposal exist anywhere

The aspiration has been mentioned in a number of places, but no. It would be roughly equivalent to the internal state of an xarray dataset following concat/merge.

Note that kerchunk could do all this entirely by itself: present virtual chunks to zarr, which have already been extracted from a non-matching grid, potentially with different codec/params for each constituent. I have been repeatedly argued that this belongs in zarr (not kechunk, not dask, not xarray), but given I already made one implementation of at least var-chunks, I remain to be convinced.

Very excited to join this group. Kerchunk has been pivotal to forecasting the electric grid using NODD weather forecasts for Camus Energy. With Martin’s help I have been able to merge some of my work on building data trees from Grib files into kerchunk already. Other pieces are operational within Camus Energy, but only shared as a prototype for the community because they are pretty use case specific. By narrowing the use case to NODD Grib files I was able to build FMRC aggregations much faster and more flexibly - reading the small idx files rather than the big grib files to index the dataset, then creating the zarr data from that index at request time. Not sure where this goes, but reading the idx files for grib is a huge win.

What I would love, is to let xarray do the combine and somehow persist that information, but it doesn’t seem possible.

FYI I just had a go at doing that, and got fairly far (notebook here).

1 Like

What I would love, is to let xarray do the combine and somehow persist that information, but it doesn’t seem possible.

FYI I just had a go at doing that, and got fairly far (notebook here ).

For an update on this see VirtualiZarr (there will be a showcase talk about it soon).