Using kerchunk's MultiZarrToZarr on a dataset with multiple vertical coordinates

Hello community,
I have a dataset of historical runs of the ICON-D2 configuration of the ICON model from German Weather Forecasting Agency (DWD).
The dataset has dimensions (atmospheric quantity, latitude,longitude, vertical level, forecast init time, forecast lead time).
The dataset is stored in grib2 files. Each grib2 file contains data for the whole latlon grid, but only one quantity, vertical level, init time, and lead time ([https://opendata.dwd.de/weather/nwp/icon-d2/grib/] it is stored in the same fashion as DWD provides it on their open data site).

I have used kerchunk’s scan_grib to generate a zarr reference for each file in the dataset. Now I want to combine all of those json references into one zarr parquet reference. However, different atmospheric quantities have different dimensions for their vertical level.
e.g. the quantity short-named tke (turbulent kinetic energy) has generalVertical as its vertical level type:

grib_ls icon-d2_germany_regular-lat-lon_model-level_2024060312_001_25_tke.grib2 
icon-d2_germany_regular-lat-lon_model-level_2024060312_001_25_tke.grib2
edition      centre       date         dataType     gridType     stepRange    typeOfLevel  level        shortName    packingType  
2            edzw         20240603     fc           regular_ll   1            generalVertical  25           tke          grid_simple 
1 of 1 messages in icon-d2_germany_regular-lat-lon_model-level_2024060312_001_25_tke.grib2

1 of 1 total messages in 1 files

whereas the quantity u for zonal wind has generalVerticalLayer as its vertical level type:

grib_ls icon-d2_germany_regular-lat-lon_model-level_2024060312_005_25_u.grib2 
icon-d2_germany_regular-lat-lon_model-level_2024060312_005_25_u.grib2
edition      centre       date         dataType     gridType     stepRange    typeOfLevel  level        shortName    packingType  
2            edzw         20240603     fc           regular_ll   5            generalVerticalLayer  25           u            grid_simple 
1 of 1 messages in icon-d2_germany_regular-lat-lon_model-level_2024060312_005_25_u.grib2

1 of 1 total messages in 1 files

So when I load my jsons generated by scan_grib and try to combine them using MultiZarrToZarr the following will obviously only work for quantities that have “generalVerticalLayer” as their vertical dimension (time is the forecast init time, step is the forecast lead time).

fs = fsspec.filesystem("file")
out = LazyReferenceMapper.create(record_size=1000, root=output_dir, fs=fs)
all_jsons = glob.glob(str(input_root/ "*.json"))
out_dict = combine.MultiZarrToZarr(
        all_jsons,
        concat_dims=["time", "generalVerticalLayer", "step"],
        identical_dims=["latitude", "longitude"],
        out=out,
    )

However, what I would ideally like is to get a combined reference that contains both quantities that are only defined on the surface, quantities that have “generalVertical” as their vertical dimension, and quantities that have “generalVerticalLayer” as their vertical dimension. I do not know how to get this done using MultiZarrToZarr.
Right now my workaround is to have separate reference archives for each possible vertical coordinate (there are about 10 different verticals in the dataset) and only combine them when loading them into xarray.

My versions are:

  • kerchunk: 0.2.5
  • zarr: 2.18.7
  • fsspec: 2025.3.2

Thanks for taking your time to read my question, any input welcome.
Best,
Christian

This is one of the reasons I wrote VirtualiZarr. (See my original issue here - I had exactly the same problem as you do - it’s not really possible to express what you want succinctly using the MultiZarrToZarr API.) Using VirtualiZarr combining these references is easier (as easy as doing it with xarray).

However we don’t yet have grib support in VirtualiZarr. However you might be able to get around this by using scan_grib and then calling vz.open_virtual_dataset on the resultant JSON like this.

2 Likes