Hello community,
I have a dataset of historical runs of the ICON-D2 configuration of the ICON model from German Weather Forecasting Agency (DWD).
The dataset has dimensions (atmospheric quantity, latitude,longitude, vertical level, forecast init time, forecast lead time).
The dataset is stored in grib2 files. Each grib2 file contains data for the whole latlon grid, but only one quantity, vertical level, init time, and lead time ([https://opendata.dwd.de/weather/nwp/icon-d2/grib/] it is stored in the same fashion as DWD provides it on their open data site).
I have used kerchunk’s scan_grib
to generate a zarr reference for each file in the dataset. Now I want to combine all of those json references into one zarr parquet reference. However, different atmospheric quantities have different dimensions for their vertical level.
e.g. the quantity short-named tke
(turbulent kinetic energy) has generalVertical
as its vertical level type:
grib_ls icon-d2_germany_regular-lat-lon_model-level_2024060312_001_25_tke.grib2
icon-d2_germany_regular-lat-lon_model-level_2024060312_001_25_tke.grib2
edition centre date dataType gridType stepRange typeOfLevel level shortName packingType
2 edzw 20240603 fc regular_ll 1 generalVertical 25 tke grid_simple
1 of 1 messages in icon-d2_germany_regular-lat-lon_model-level_2024060312_001_25_tke.grib2
1 of 1 total messages in 1 files
whereas the quantity u
for zonal wind has generalVerticalLayer
as its vertical level type:
grib_ls icon-d2_germany_regular-lat-lon_model-level_2024060312_005_25_u.grib2
icon-d2_germany_regular-lat-lon_model-level_2024060312_005_25_u.grib2
edition centre date dataType gridType stepRange typeOfLevel level shortName packingType
2 edzw 20240603 fc regular_ll 5 generalVerticalLayer 25 u grid_simple
1 of 1 messages in icon-d2_germany_regular-lat-lon_model-level_2024060312_005_25_u.grib2
1 of 1 total messages in 1 files
So when I load my jsons generated by scan_grib
and try to combine them using MultiZarrToZarr the following will obviously only work for quantities that have “generalVerticalLayer” as their vertical dimension (time is the forecast init time, step is the forecast lead time).
fs = fsspec.filesystem("file")
out = LazyReferenceMapper.create(record_size=1000, root=output_dir, fs=fs)
all_jsons = glob.glob(str(input_root/ "*.json"))
out_dict = combine.MultiZarrToZarr(
all_jsons,
concat_dims=["time", "generalVerticalLayer", "step"],
identical_dims=["latitude", "longitude"],
out=out,
)
However, what I would ideally like is to get a combined reference that contains both quantities that are only defined on the surface, quantities that have “generalVertical” as their vertical dimension, and quantities that have “generalVerticalLayer” as their vertical dimension. I do not know how to get this done using MultiZarrToZarr.
Right now my workaround is to have separate reference archives for each possible vertical coordinate (there are about 10 different verticals in the dataset) and only combine them when loading them into xarray.
My versions are:
- kerchunk: 0.2.5
- zarr: 2.18.7
- fsspec: 2025.3.2
Thanks for taking your time to read my question, any input welcome.
Best,
Christian