This is an invite for anyone interested in the kerchunk project! What are you hoping we will work on this year, what do you think you might be able to contribute, and which archives/datasets do you think we can target for kerchunking?
Let’s get some thoughts down in this thread (please invite anyone you think might be interested), and we can plan a live discussion to plan out where the project can go next.
I’d like to have more patterns to combine references from different files. Currently, MultiZarrToZarr is the only way to do the combination, and it is always a (multi-dimensional?) concat.
Some ocean models use separate files (e.g. grid files) for information that is constant in time, which would need a merge instead of a concat to be included.
(this is already possible using custom code on the raw references, so I guess I’m requesting a higher-level API for these)
Edit: this may be a slippery slope, though, since we don’t actually want to reimplement xarray in kerchunk. So I realize we’ll need to draw a line somewhere.
Regarding the current API, I think MultiZarrToZarr currently does too much at once, leading to a somewhat unsatisfactory API. Instead, I believe it would benefit from being split into smaller functions / classes, like concat, merge, or creating 0d variables from attributes / metadata (filepath?).
I am interested in prototyping kerchunk for CMIP7.
Specifically I would like to come up with a recommendation of how to potentially postprocess/rechunk netcdfs as part of the ESGF publishing requirements so that a kerchunk index of netcdfs on cloud storage provides a “good enough” performance compared to convertibg to zarrs.
Maybe this is something that has already been done though?
Either way if there are specific recommendations we can derive my hope is that we will not have to duplicate all CMIP7 (estimated 100+PB ) as zarrs.
I’m definitely curious to see if there are ways to enable kerchunk to use the Rust IO backend I’m hacking away on at the moment. (The ultimate aim would be to enable kerchunk to go even faster!). But it’s still a few months before my Rust stuff is anywhere close to being ready for use . And I haven’t looked at detail at the kerchunk code so I have no idea if this avenue is even vaguely sensible (no worries if it’s not!)
I was half-thinking about rebooting the rust-rs project specifically for handling kerchunk. Of course, rfsspec already exists and showed that IO for remote data could be improved only marginally by using tokio instead of asyncio. Of course, your interests are more for super high bandwidth on local storage; but I am prepared to be proved wrong on remote bytes too.
The presentation/advertisement I made for kerchunk’ing GRIB files at AMS this week was pretty well received. I’m following up and gathering feedback from folks to see where there may be opportunities to further improve the GRIB utilities in the library.
I also plan on spending a bit of time this quarter working through some of the MultiZarrToZarr helpers/wrappers I’ve built to see if there are common patterns which would make sense to bring upstream into kerchunk in some way.
The GRIB access seems to be a killer use case for the weather folks at the moment - a way to create ARCO-like and “good enough” cloud datasets for end users without expensive data engineering commitments.
Thanks for starting this discussion! From a feature development standpoint, I am most interested in a way to validate that the data have not changed from when kerchunk was used to create a reference file (i.e., checksums) and simplifying/standardizing the API. I also think it would be really valuable for the community to agree on a roadmap for which features will get upstreamed once Zarr V3 is finalized and the “chunk manifest” idea comes to fruition, to avoid the same work happening in multiple places.
I’d also be excited to contribute to a Pangeo data commons for Kerchunk references (or at least a catalog of existing references).
I’m interested in having kerchunk as an on-the-fly caching mechanism for earthaccess, specially for those datasets with very nested data(ICESat-2), I think variable-length chunking is a no go for Zarr but it would be cool if we could be ahead of the curve for when this gets implemented.
I’d like to engage on the subject of a zarr-v3 spec compatible manifest spec (one kerchunk could use but may be separable). If such a spec could also include a way to represent concatenation of multiple arrays with varying codecs, that would also be a killer thing to support.
Does this “chunk manifest” proposal exist anywhere
The aspiration has been mentioned in a number of places, but no. It would be roughly equivalent to the internal state of an xarray dataset following concat/merge.
Note that kerchunk could do all this entirely by itself: present virtual chunks to zarr, which have already been extracted from a non-matching grid, potentially with different codec/params for each constituent. I have been repeatedly argued that this belongs in zarr (not kechunk, not dask, not xarray), but given I already made one implementation of at least var-chunks, I remain to be convinced.
Very excited to join this group. Kerchunk has been pivotal to forecasting the electric grid using NODD weather forecasts for Camus Energy. With Martin’s help I have been able to merge some of my work on building data trees from Grib files into kerchunk already. Other pieces are operational within Camus Energy, but only shared as a prototype for the community because they are pretty use case specific. By narrowing the use case to NODD Grib files I was able to build FMRC aggregations much faster and more flexibly - reading the small idx files rather than the big grib files to index the dataset, then creating the zarr data from that index at request time. Not sure where this goes, but reading the idx files for grib is a huge win.