Spatially un-chunked grib2 use case : can I do something with/before Kerchunk?

Hi all,

Here’s the reference notebook for the below.

I have a bunch of 28 grib2 files for a single time step in a model run of the High Resolution Deterministic Prediction System available as public data here. When I get to the combined kerchunked zarr stage, I get 28 pressure levels, but any one of those is not chunked e.g. by lat/lon.

mzz_tmp = MultiZarrToZarr(tmp_list,
                        concat_dims = ['isobaricInhPa'],
                        identical_dims=['latitude', 'longitude', 'valid_time', 'step'] )
d_tmp = mzz_tmp.translate()
fs_tmp = fsspec.filesystem("reference", fo=d_tmp, remote_protocol='https')
m_tmp = fs_tmp.get_mapper("")
ds_tmp = xr.open_dataset(m_tmp, engine="zarr", backend_kwargs=dict(consolidated=False), chunks={})
ds_tmp #See image below

So if I want to get the t variable at each pressure level for a specific location I don’t get much (any?) access speed benefit from kerchunking the files compared to opening them one by one.

Questions :

  • Do I get this right ?
  • Is there anything I can do about that ? I would guess lat/lon have specific dtypes/byte_sizes that would allow one to chunk them in reference files through byte offsets ?

Thanx for pointers,

Yves

1 Like

When you kerchunk-index an original data file, each data portion in that file becomes one “chunk”. With the sole exception of completely uncompressed data, there is no way for kerchunk to divide these inherent chunks. grib2 uses a complex encoding to make the on-disk data size as small as possible, so I don’t think that any algorithm could access the data without reading a whole chunk.

kerchunk for grib2 is particularly good at accessing the “grib messages” within a file separately. It is pretty common to concatenate these together into one file. It seems that the files here one “message” per input file.

In short, this is the best you can do with kerchunk here. So what does it get you? It allows you to have a single logical zarr handle to the whole dataset instead of having xarray do a runtime merge every time you open it; you don’t need to download it all if you only access some of the chunks. It also allows for concurrent read of multiple chunks, which can be a big accelerator in some cases. However, reading one chunk or part of one chunk will involve reading just as many bytes as if you had downloaded the file in question.

1 Like

Indeed. I definitely see the benefit of a logical zarr handle.

Anybody knows of spatially chunked grib2 files examples that would perform well e.g. in slicing all forecast hours in a model run? Would a spatial chunking be visible as another message in the grib ?

Thank you for this great project !

I don’t think that messages can refer to each other. Actually the best way to do it might be to have independent grib tiles and have kerchunk be the thing to tie them together.

As you say, maybe others have thought about this too.

One cat cat gribs in append mode to create bigger gribs. I would think this results in a new message in that bigger grib, which kerchunk could process ?

Yes, grib files are cat-ed messages, and kerchunk can process them. I don’t think it’s normal to have the messages have different coordinates, but I don’t see why not.