Hello,
I recently started developing a test case for generating Regional Ocean Modeling system meteorological forcing files from cloud based datasets. Specifically HRRR and GFS. My approach to the s3 buckets is based on Rich Signell and Peter Marsh’s work using kerchunk to generate reference zarr jsons and accessing the grib files using the zarr engine. This worked well for the HRRR dataset but fails for GFS.
example code:
today = dt.datetime.utcnow().strftime(‘%Y%m%d’)
json_dir = ‘./jsons/’
fs = fsspec.filesystem(‘s3’, anon=True, skip_instance_cache=True)
urls = [‘s3://’ + f for f in fs.glob(f’s3://noaa-gfs-bdp-pds/gfs.{today}/00/atmos/gfs.t00z.pgrb2.0p25.f0*‘)]
urls = [f for f in urls if not f.endswith(’.idx’)]
urls=urls[0:2]
def gen_json_grib(u):
name = u.split(‘/’)[-1:][0]
outfname = f’{json_dir}{name}.json’
out = scan_grib(u, common=None,storage_options=so, inline_threshold=200,filter=afilter)
with open(outfname, “wb”) as f:
f.write(ujson.dumps(out).encode())
afilter={‘typeOfLevel’: ‘heightAboveGround’, ‘level’: 2}
so = {“anon”: True}
dask.compute([dask.delayed(gen_json_grib)(u) for u in urls])
jsonfiles=sorted(glob(json_dir+'gfs.t00z.pgrb2.json’))
mzz = MultiZarrToZarr(jsonfiles,concat_dims=[‘time’],
remote_protocol=‘s3’,
remote_options={‘anon’: True})
mzz.translate(‘tmp.json’)
The script fails on the last line.
It looks like the problem is that ujson.load() call on the zarr json returns a list instead of a dictionary, which fails when used in fsspec. Maybe because of a detail of the GFS grib file? I wonder if anyone else has run into this? Is there a way to restructure the json as a dictionary in a sensible way? Or maybe I’m missing something completely?
Thanks,
Eli