Combine multiple grib messages into one file for reading in xarray

Hi,

One issue with the xarray open_mfdataset function is that it’s slow when loading a large number of grib files. Files can also contain additional grib messages that are not required. I’m aware that I could use kerchunk to scan the grib files and create a json representation of a zarr array allowing one to combine multiple files/messages with filtering. However, I already a large number of grib files scanned with the following information:

  • file
  • message_start_bytes
  • message_length_bytes

along with additional attributes that allows me to filter by variable etc.

My thinking is that this information should be sufficient to create a fsspec ReferenceFileSystem so that I have one file with many grib messages instead of many files. I have a simple example for a local file system below that fails:

import xarray as xr
from fsspec.implementations.reference import ReferenceFileSystem

fs = ReferenceFileSystem(
    {
        "key1": [
            "file1.grib2",
            0,
            1000,
        ],
        "key2": [
            "file2.grib2",
            1001,
            1000,
        ],
    }
)

m = fs.get_mapper("")
ds = xr.open_mfdataset(m, engine="cfgrib", indexpath="")

Any thoughts on what I might be doing wrong here are welcome. Alternatively, please do let me know if you think there’s a better approach to combining the filename and grib message byte range information into a single file.

I’m aware that I could use kerchunk to scan the grib files and create a json representation of a zarr array allowing one to combine multiple files/messages with filtering.

My thinking is that this information should be sufficient to create a fsspec ReferenceFileSystem

If I understand correctly, you’re right. You can indeed construct an fsspec ReferenceFileSystem without using Kerchunk, as long as the references follow the spec.

In GitHub - TomAugspurger/cogrib I tried to do something similar, by building the references using the .index files that some providers give (which I think give the byte ranges for each variable) but I didn’t get too far with that because that format didn’t seem to be standardized.

Thanks Tom, both these suggestions have been very useful in helping me understand how I should approach the problem.