Hi all,
I’ve been impressed by kerchunk for datasets where the JSON references file is also of manageable size.
I recently had a use-case, however, that required processing a massive 100 TB+ dataset (SS1330 CHIRP AIRS L1C radiances) spanned by a very large number of files (1 million+ as we are talking data being archived at 6 minute granularity here) and the resulting JSON was many GB.
For this particular use-case the reference files are especially large even when attempting to turn off inlining and vlen encoding because simply put there are a ton of URLs and variables that need to be referenced.
Others have already noted how this can create problems with not only memory usage but also longer data access times.
(I also noticed this when looking at one of the case studies notebooks mentioned in the documentation for the MUR SST dataset which I would also imagine has a larger references file than for most and noticed in the provided benchmark numbers that although opening the full dataset was still a big improvement over pure netCDF, it still lags far behind the native zarr versions.)
That’s when the idea hit me: Rather than passing in entire JSON files, maybe it would make sense to store the kerchunk references themselves as another zarr store.
I don’t know if others have already tried something like this but I used this approach to dramatically improve the performance for use-cases with large number of files and variables, which is often the case for satellite datasets available from NASA Earthdata cloud.
I have provided the details with some quick benchmarks shown in this notebook.
The approach is to generate a pseudo-zarr store-like directory structure with the actual JSON with the byte ranges taking the place of the chunked array files, then use an fsspec supported filesystem mapping (wrapped by a preprocessor to format the data correctly) as the input to the fsspec reference filesystem in place of the JSON references file. The end result looks something like this:
local_fs = fsspec.filesystem('file') # can also be s3fs, gcsfs, etc
mapper = ReferencesPreprocessor(local_fs.get_mapper(store_name))
ref_fs = fsspec.filesystem('reference', fo=mapper, **storage_options)
ds = xr.open_zarr(ref_fs.get_mapper(''), consolidated=True)
I have not extensively tested yet so I do not yet know how robust it is in a production setting but I have gotten very promising results so far. Would like to hear any thoughts and comments from others. If this proves to be useful I think it would be great if perhaps some of these changes could make their way upstream to fsspec and/or kerchunk since some aspects of the trick (particularly the preprocessor) feel kind of hacky to implement.