Hi all,
Recently I have been spending a lot of time experimenting with kerchunk and I have been quite impressed by the performance so far, but most of the datasets I have tried until now were spanned by a relatively small number of files for their size, which in turn means the JSON references files are also of manageable size. However I am now finding myself in a predicament where I have a use-case that requires processing a massive 100 TB+ dataset (SS1330 CHIRP AIRS L1C radiances) that is spanned by a very large number of files (1 million+ as we are talking data being archived at 6 minute granularity here). I have quickly come to realize that for this particular use-case my reference files are especially large even when attempting to turn off inlining and vlen encoding because simply put there are a ton of URLs and variables that need to be referenced. Others have already noted how this can create problems with not only memory usage but also longer data access times. I also noticed this when looking at one of the case studies notebooks mentioned in the documentation for the MUR SST dataset which I would also imagine has a larger references file than for most and noticed in the provided benchmark numbers that although opening the full dataset was still a big improvement over pure netCDF, it still lags far behind the native zarr versions. That’s when the idea hit me: Rather than passing in entire JSON files, maybe it would make sense to readapt the kerchunk references directly as just another zarr store (created as it would look on a real filesystem). I don’t know if others have already tried something like this but I have managed to develop a trick to dramatically improve the performance for these use-cases with relatively large number of files and variables, which is more often than not the case with satellite datasets available from NASA Earthdata cloud.
I have provided the details with some quick benchmarks shown in this notebook. and thought I would share them here knowing that many other users of zarr/kerchunk are members of this community. Without describing the full details the basic gist of the trick is to generate an essentially pseudo-zarr store like directory structure with the actual JSON with the byte ranges taking the place of the chunked array files, then use an fsspec supported filesystem mapping (wrapped by a preprocessor to format the data correctly) as the input to the fsspec reference filesystem in place of the JSON references file. The end result looks something like this:
local_fs = fsspec.filesystem('file') # can also be s3fs, gcsfs, etc
mapper = ReferencesPreprocessor(local_fs.get_mapper(store_name))
ref_fs = fsspec.filesystem('reference', fo=mapper, **storage_options)
ds = xr.open_zarr(ref_fs.get_mapper(''), consolidated=True)
I have not extensively tested yet so I do not yet know how robust it is in a production setting but I have gotten very promising results so far. Would like to hear any thoughts and comments from others. If this proves to be useful I think it would be great if perhaps some of these changes could make their way upstream to fsspec and/or kerchunk since some aspects of the trick (particularly the preprocessor) feel kind of hacky to implement.