Making kerchunk as simple as a toggle?

ahuang11 · September 21, 2023, 4:31am

As someone who recently discovered kerchunk and has to constantly reference the Kerchunk cookbook, I am wondering whether it’s a good idea (or if it’s even possible) to have kerchunk as a simple toggle kwarg in xr.open_dataset? Right now I feel like there’s a lot of steps to remember to do (1. generate reference files, 2. wrap fsspec on those reference files, 3. pass that to xr.open_dataset–if these steps are even accurate)

I feel like it could be done, but I’ve only used kerchunk a teeny bit for examples so I don’t have too much context.

I imagine it can be used like xr.open_dataset("unoptimized_file.nc", kerchunk=True) and it would generate the reference files in the current directory if it doesn’t exist–or use the existing generated reference files. And, depending on the engine used, it would use the appropriate kerchunk backend like xr.open_dataset("unoptimized_file.grib", engine="cfgrib", kerchunk=True)

As an analogy, I’m thinking of how datashader can be used with hvplot by setting df.hvplot(datashade=True) and I am hoping that kerchunk can be that simple, but again I haven’t used kerchunk extensively.

TomAugspurger · September 21, 2023, 11:39am

Funny timing, I was just looking into this last night, for the same reasons you mention

This can be done with an xarray backend, which would be usable like

ds = xr.open_dataset(refs, engine="kerchunk")
ds

where refs is the url to the JSON file, parquet file, or the in-memory references.

If you need any additional keywords (like remote_protocol, remote_storage_options) those would still need to be passed in.

I opened Add `xarray.open_dataset` backend · Issue #360 · fsspec/kerchunk · GitHub to discuss adding this to kerchunk.

TomNicholas · September 21, 2023, 2:32pm

This is a cool idea!

It sounds like what you’re suggesting Tom is simpler than what you originally suggested Andrew - just eliminating the fsspec step but not actually running kerchunk to generate the references automatically like Andrew suggests. Are you thinking that automatically running kerchunk called from an xarray backend would be “too auto-magical”?

TomAugspurger · September 21, 2023, 4:02pm

Ohh, I missed

it would generate the reference files in the current directory if it doesn’t exist–or use the existing generated reference files.

entirely. I was just thinking about the case where you already have references.

Doing that automatically does feel pretty magical… Lots of potential complications around things like reading files from remote filesystems, but maybe still worth doing.

I have noticed that cfgrib outputs a .idx file in the local directory automatically.

dcherian · September 21, 2023, 4:03pm

I don’t see how this could work in general for all the reasons that coo_map exists.

ahuang11 · September 21, 2023, 4:34pm

Maybe setting concat_dim like in xr.open_mfdataset?

ahuang11 · September 21, 2023, 7:57pm

Doing that automatically does feel pretty magical… Lots of potential complications around things like reading files from remote filesystems, but maybe still worth doing.

I believe we should identify the most common use-case and should support that. Then for other cases, the user can drop down to the lower level.

Again analogous to hvplot covering most use-cases → holoviews → hooks bokeh/matplotlib → render as bokeh/matplotlib figures.

martindurant · September 22, 2023, 9:03pm

Making the references locally seems like a form of caching - but it doesn’t store too much data. It seems like it should be doable for common cases, and generally if the correct set of arguments to make the references is stored somewhere, say a catalog.

TomNicholas · September 23, 2023, 3:00am

If you made a backend that understood concat_dim via open_mfdataset (which already would require changes to xarray’s backend entrypoint base class I think), then you would also find that open_mfdataset(engine='kerchunk') could only deal with a subset of the cases open_mfdataset can normally deal with: those with regular chunking. It would be another rmotivation to have Zarr support irregular chunking.

EDIT: It seems there are other scenarios which xarray.open_mfdataset’s combining algorithms can deal with which kerchunk currently cannot.

rabernat · September 28, 2023, 3:59pm

We’ll be presenting our approach to making kerchunk usage simpler at next week’s Pangeo Showcase

The approach requires the use of a backend database to store the references, so it might not meet every use case. But it certainly improves the user experience and solves some consistency challenges!

TomNicholas · February 5, 2024, 3:59pm

@ahuang11 I’ve had a go at making something similar work in this notebook, see Refactor MultiZarrToZarr into multiple functions · Issue #377 · fsspec/kerchunk · GitHub.

The aim is to make generating references use xarray syntax instead:

ds = xr.open_mfdataset(
    '/my/files*.nc',
    engine='kerchunk',  # kerchunk registers an xarray IO backend that returns zarr.Array objects
    combine='nested',  # 'by_coords' would require actually reading coordinate data
    parallel=True,  # would use dask.delayed to generate reference dicts for each file in parallel
)

ds  # now wraps a bunch of zarr.Array / kerchunk.Array objects directly, not numpy/dask arrays

ds.kerchunk.to_json('newstore.zarr')  # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them

You would then still need to open the data the normal way from your new references, but the actual generation of the references becomes much more intuitive.

Michael_Sumner · August 6, 2024, 11:05pm

awesome, this disucssion is what I was looking for a long time without knowing what to ask for!

creating individual jsons and combining them is fine for me for now while I learn more about kerchunk but I’m following along and keen to discuss prospects. thanks!

TomNicholas · August 7, 2024, 12:50am

@Michael_Sumner my previous comment has now developed into the VirtualiZarr package (see Pangeo Showcase: "VirtualiZarr: Create virtual Zarr stores using xarray syntax")

I have also suggested some ideas for how this might all be integrated upstream and made more automatic in Splitting out lazy indexing layer and backends layer as zarr-python features · Issue #9281 · pydata/xarray · GitHub and Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs · Issue #303 · zarr-developers/zarr-specs · GitHub .

Michael_Sumner · August 7, 2024, 1:22am

excellent, thanks for the guidance, I didn’t realize that VirtualiZarr was the next stage/s

I asked a question on gdal-dev (that was entirely off-base in terms of how to proceed), but the reply by dev Even was very helpful and I think I can probably contribute that on the GDAL side:

https://lists.osgeo.org/pipermail/gdal-dev/2024-July/059256.html

I’ll offer more specific feedback as I pivot in from various directions.

Michael_Sumner · August 7, 2024, 1:43am

Just to add, this seems to me like the panacea for netcdf generally, and providers would simply point us to their maintained virtual Zarr, and that could be used directly or as a way to sync “locally” the subset required , rather than “us” generating the json. Certainly we’ll be recasting our disk and object storage to include this mech now.

Also, is there kerchunk effort into HDF4? That was the real legacy that I had no way to access remotely, while everything else seems well covered now by various protocols. Kerchunk makes netcdf faster but it’s not actually needed for access per se, whereas with HDF4 there’s no remote access at all and so I don’t get why it’s not in the kerchunk list (??). Maybe I have a terminology or other confusion here

TomNicholas · August 7, 2024, 7:41pm

Thanks for sharing this! I think the exact format of the references is still in flux here - kerchunk’s json/parquet format exists but we’re having active discussions about how exactly we could take this to it’s logical conclusion and represent the manifest in zarr itself upstream. I.e. making Zarr a “SuperFormat”. This is the issue to follow: Manifest storage transformer · Issue #287 · zarr-developers/zarr-specs · GitHub.

HDF4

I don’t know much about HDF4, but there is some mention of it on the VirtualiZarr tracker, so I raised an issue to discuss it here: Support HDF4? · Issue #216 · zarr-developers/VirtualiZarr · GitHub

TomNicholas · August 7, 2024, 7:51pm

Also @ahuang11 on your original question, see this comment from today:

github.com/zarr-developers/VirtualiZarr

Loading data from ManifestArrays without saving references to disk first

opened 04:37PM - 23 May 24 UTC

ayushnag

I am working on a feature in `virtualizarr` [to read dmrpp metadata files and cr…eate a virtual `xr.Dataset`](https://github.com/TomNicholas/VirtualiZarr/issues/85) containing manifest array's that can then be virtualized. This is the current workflow: ```python vdatasets = parser.parse(dmrs) # vdatasets are xr.Datasets containing ManifestArray's mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs) mds.virtualize.to_kerchunk(filepath=outfile, format=outformat) ds = xr.open_dataset(outfile, engine="virtualizarr", ...) ds.time.values ``` However the chunk manifest, encoding, attrs, etc. is already in `mds` so is it possible to read data directly from this dataset? My understanding is that once the "chunk manifest" ZEP is approved and the `zarr-python` reader in `xarray` is updated this should be possible. The `xarray` reader for `kerchunk` can accept a file or the reference json object directly from `kerchunk` `SingleHdf5ToZarr ` and `MultiZarrToZarr`. So similarly can we extract the refs from `mds` and pass it to `xr.open_dataset()` directly? There probably still needs to be a function that extracts the refs so that xarray can make a new `Dataset` object with all the indexes, cf_time handling, and `open_dataset` checks. ```python mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs) refs = mds.virtualize() ds = xr.open_dataset(refs, engine="virtualizarr", ...) ``` Even reading directly from the ManifestArray dataset is possible but not sure how the new dataset object with numpy arrays and indexes would be separate from the original dataset ```python mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs) mds.time.values ```

Basically I think that once we have storage manifest transformers upstream in zarr-python, then we could turn virtualizarr.ManifestArrays directly into zarr.Arrays. Then we could set it up such that you could use an engine='virtualizarr' kwarg to xr.open_dataset to basically achieve what you’re asking for above.

martindurant · August 13, 2024, 1:20pm

Kerchunk could probably do HDF4 if there were demand. We do do netCDF3, another legacy format.

Michael_Sumner · August 13, 2024, 7:26pm

thanks! I just find it funny that I couldn’t find any mention of it, like maybe it was not possible at all … the NASA stores for HDF4 are immense - but probably it’s just an important data source in my circles (and probably not as important as it once was, for L1 sea ice and ocean colour).

martindurant · August 13, 2024, 8:12pm

I am having trouble finding a byte-by-byte spec for HDF4. Is this a “spec by code” case? I’d rather not read C and FORTRAN routines…

Topic		Replies	Views
Kerchunk planning News & Announcements	36	1158	April 14, 2024
Trick for improving Kerchunk performance for large numbers of chunks/files Data	11	1644	February 2, 2023
Accessing GRIB2 files as a single cloud-friendly dataset in xarray through kerchunk Data	15	3245	October 28, 2022
Accessing nested HDF5 file from http via kerchunk Data	5	1527	May 14, 2023
Recommendation for hosting cloud-optimized data Data	15	2805	January 21, 2022

Making kerchunk as simple as a toggle?

Related topics