Combine multiple grib messages into one file for reading in xarray

james · May 17, 2023, 9:00am

Hi,

One issue with the xarray open_mfdataset function is that it’s slow when loading a large number of grib files. Files can also contain additional grib messages that are not required. I’m aware that I could use kerchunk to scan the grib files and create a json representation of a zarr array allowing one to combine multiple files/messages with filtering. However, I already a large number of grib files scanned with the following information:

file
message_start_bytes
message_length_bytes

along with additional attributes that allows me to filter by variable etc.

My thinking is that this information should be sufficient to create a fsspec ReferenceFileSystem so that I have one file with many grib messages instead of many files. I have a simple example for a local file system below that fails:

import xarray as xr
from fsspec.implementations.reference import ReferenceFileSystem

fs = ReferenceFileSystem(
    {
        "key1": [
            "file1.grib2",
            0,
            1000,
        ],
        "key2": [
            "file2.grib2",
            1001,
            1000,
        ],
    }
)

m = fs.get_mapper("")
ds = xr.open_mfdataset(m, engine="cfgrib", indexpath="")

Any thoughts on what I might be doing wrong here are welcome. Alternatively, please do let me know if you think there’s a better approach to combining the filename and grib message byte range information into a single file.

TomAugspurger · May 18, 2023, 6:19pm

I’m aware that I could use kerchunk to scan the grib files and create a json representation of a zarr array allowing one to combine multiple files/messages with filtering.
…
My thinking is that this information should be sufficient to create a fsspec ReferenceFileSystem

If I understand correctly, you’re right. You can indeed construct an fsspec ReferenceFileSystem without using Kerchunk, as long as the references follow the spec.

In GitHub - TomAugspurger/cogrib I tried to do something similar, by building the references using the .index files that some providers give (which I think give the byte ranges for each variable) but I didn’t get too far with that because that format didn’t seem to be standardized.

james · May 24, 2023, 11:47am

Thanks Tom, both these suggestions have been very useful in helping me understand how I should approach the problem.

Topic		Replies	Views
Accessing GRIB2 files as a single cloud-friendly dataset in xarray through kerchunk Data	15	3334	October 28, 2022
Spatially un-chunked grib2 use case : can I do something with/before Kerchunk? Data	5	611	May 9, 2023
Using grib2 files with `open_mfdataset`: is there a better workflow than converting to netcdf?	4	1429	July 27, 2022
Issue accessing cloud GFS data using kerchunk Cloud	2	611	February 2, 2023
Trick for improving Kerchunk performance for large numbers of chunks/files Data	11	1719	February 2, 2023

Combine multiple grib messages into one file for reading in xarray

Related topics