Making kerchunk as simple as a toggle?

As someone who recently discovered kerchunk and has to constantly reference the Kerchunk cookbook, I am wondering whether it’s a good idea (or if it’s even possible) to have kerchunk as a simple toggle kwarg in xr.open_dataset? Right now I feel like there’s a lot of steps to remember to do (1. generate reference files, 2. wrap fsspec on those reference files, 3. pass that to xr.open_dataset–if these steps are even accurate)

I feel like it could be done, but I’ve only used kerchunk a teeny bit for examples so I don’t have too much context.

I imagine it can be used like xr.open_dataset("unoptimized_file.nc", kerchunk=True) and it would generate the reference files in the current directory if it doesn’t exist–or use the existing generated reference files. And, depending on the engine used, it would use the appropriate kerchunk backend like xr.open_dataset("unoptimized_file.grib", engine="cfgrib", kerchunk=True)

As an analogy, I’m thinking of how datashader can be used with hvplot by setting df.hvplot(datashade=True) and I am hoping that kerchunk can be that simple, but again I haven’t used kerchunk extensively.

3 Likes

Funny timing, I was just looking into this last night, for the same reasons you mention :slight_smile:

This can be done with an xarray backend, which would be usable like

ds = xr.open_dataset(refs, engine="kerchunk")
ds

where refs is the url to the JSON file, parquet file, or the in-memory references.

If you need any additional keywords (like remote_protocol, remote_storage_options) those would still need to be passed in.

I opened Add `xarray.open_dataset` backend · Issue #360 · fsspec/kerchunk · GitHub to discuss adding this to kerchunk.

5 Likes

This is a cool idea!

It sounds like what you’re suggesting Tom is simpler than what you originally suggested Andrew - just eliminating the fsspec step but not actually running kerchunk to generate the references automatically like Andrew suggests. Are you thinking that automatically running kerchunk called from an xarray backend would be “too auto-magical”?

Ohh, I missed

it would generate the reference files in the current directory if it doesn’t exist–or use the existing generated reference files.

entirely. I was just thinking about the case where you already have references.

Doing that automatically does feel pretty magical… Lots of potential complications around things like reading files from remote filesystems, but maybe still worth doing.

I have noticed that cfgrib outputs a .idx file in the local directory automatically.

I don’t see how this could work in general for all the reasons that coo_map exists.

Maybe setting concat_dim like in xr.open_mfdataset?

Doing that automatically does feel pretty magical… Lots of potential complications around things like reading files from remote filesystems, but maybe still worth doing.

I believe we should identify the most common use-case and should support that. Then for other cases, the user can drop down to the lower level.

Again analogous to hvplot covering most use-cases → holoviews → hooks bokeh/matplotlib → render as bokeh/matplotlib figures.

Making the references locally seems like a form of caching - but it doesn’t store too much data. It seems like it should be doable for common cases, and generally if the correct set of arguments to make the references is stored somewhere, say a catalog.

2 Likes

If you made a backend that understood concat_dim via open_mfdataset (which already would require changes to xarray’s backend entrypoint base class I think), then you would also find that open_mfdataset(engine='kerchunk') could only deal with a subset of the cases open_mfdataset can normally deal with: those with regular chunking. It would be another rmotivation to have Zarr support irregular chunking.

EDIT: It seems there are other scenarios which xarray.open_mfdataset’s combining algorithms can deal with which kerchunk currently cannot.

We’ll be presenting our approach to making kerchunk usage simpler at next week’s Pangeo Showcase

The approach requires the use of a backend database to store the references, so it might not meet every use case. But it certainly improves the user experience and solves some consistency challenges!

1 Like

@ahuang11 I’ve had a go at making something similar work in this notebook, see Refactor MultiZarrToZarr into multiple functions · Issue #377 · fsspec/kerchunk · GitHub.

The aim is to make generating references use xarray syntax instead:

ds = xr.open_mfdataset(
    '/my/files*.nc',
    engine='kerchunk',  # kerchunk registers an xarray IO backend that returns zarr.Array objects
    combine='nested',  # 'by_coords' would require actually reading coordinate data
    parallel=True,  # would use dask.delayed to generate reference dicts for each file in parallel
)

ds  # now wraps a bunch of zarr.Array / kerchunk.Array objects directly, not numpy/dask arrays

ds.kerchunk.to_json('newstore.zarr')  # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them

You would then still need to open the data the normal way from your new references, but the actual generation of the references becomes much more intuitive.

4 Likes