Issue accessing cloud GFS data using kerchunk


I recently started developing a test case for generating Regional Ocean Modeling system meteorological forcing files from cloud based datasets. Specifically HRRR and GFS. My approach to the s3 buckets is based on Rich Signell and Peter Marsh’s work using kerchunk to generate reference zarr jsons and accessing the grib files using the zarr engine. This worked well for the HRRR dataset but fails for GFS.

example code:

today = dt.datetime.utcnow().strftime(‘%Y%m%d’)
json_dir = ‘./jsons/’
fs = fsspec.filesystem(‘s3’, anon=True, skip_instance_cache=True)
urls = [‘s3://’ + f for f in fs.glob(f’s3://noaa-gfs-bdp-pds/gfs.{today}/00/atmos/gfs.t00z.pgrb2.0p25.f0*‘)]
urls = [f for f in urls if not f.endswith(’.idx’)]

def gen_json_grib(u):
name = u.split(‘/’)[-1:][0]
outfname = f’{json_dir}{name}.json’
out = scan_grib(u, common=None,storage_options=so, inline_threshold=200,filter=afilter)
with open(outfname, “wb”) as f:

afilter={‘typeOfLevel’: ‘heightAboveGround’, ‘level’: 2}
so = {“anon”: True}
dask.compute([dask.delayed(gen_json_grib)(u) for u in urls])

mzz = MultiZarrToZarr(jsonfiles,concat_dims=[‘time’],
remote_options={‘anon’: True})

The script fails on the last line.

It looks like the problem is that ujson.load() call on the zarr json returns a list instead of a dictionary, which fails when used in fsspec. Maybe because of a detail of the GFS grib file? I wonder if anyone else has run into this? Is there a way to restructure the json as a dictionary in a sensible way? Or maybe I’m missing something completely?


It is expected that each GRIB2 should return a list of reference sets, this is how the native file format is structured.
If you only actually have one output in each list, you can simply replace
f.write(ujson.dumps(out).encode()) with f.write(ujson.dumps(out[0]).encode()). Otherwise, you will need to write separate files for each piece.

Alternatively, you can return the lists of reference sets instead of writing them to JSON, and concatenate the lists before handing them to MultiZarrToZarr (return scan_grib(...)), if you have enough memory.


I think I see where I’ve gone wrong.