Issue accessing cloud GFS data using kerchunk

imcslatte · January 30, 2023, 5:03pm

Hello,

I recently started developing a test case for generating Regional Ocean Modeling system meteorological forcing files from cloud based datasets. Specifically HRRR and GFS. My approach to the s3 buckets is based on Rich Signell and Peter Marsh’s work using kerchunk to generate reference zarr jsons and accessing the grib files using the zarr engine. This worked well for the HRRR dataset but fails for GFS.

example code:

today = dt.datetime.utcnow().strftime(‘%Y%m%d’)
json_dir = ‘./jsons/’
fs = fsspec.filesystem(‘s3’, anon=True, skip_instance_cache=True)
urls = [‘s3://’ + f for f in fs.glob(f’s3://noaa-gfs-bdp-pds/gfs.{today}/00/atmos/gfs.t00z.pgrb2.0p25.f0*‘)]
urls = [f for f in urls if not f.endswith(’.idx’)]
urls=urls[0:2]

def gen_json_grib(u):
name = u.split(‘/’)[-1:][0]
outfname = f’{json_dir}{name}.json’
out = scan_grib(u, common=None,storage_options=so, inline_threshold=200,filter=afilter)
with open(outfname, “wb”) as f:
f.write(ujson.dumps(out).encode())

afilter={‘typeOfLevel’: ‘heightAboveGround’, ‘level’: 2}
so = {“anon”: True}
dask.compute([dask.delayed(gen_json_grib)(u) for u in urls])
jsonfiles=sorted(glob(json_dir+'gfs.t00z.pgrb2.json’))

mzz = MultiZarrToZarr(jsonfiles,concat_dims=[‘time’],
remote_protocol=‘s3’,
remote_options={‘anon’: True})
mzz.translate(‘tmp.json’)

The script fails on the last line.

It looks like the problem is that ujson.load() call on the zarr json returns a list instead of a dictionary, which fails when used in fsspec. Maybe because of a detail of the GFS grib file? I wonder if anyone else has run into this? Is there a way to restructure the json as a dictionary in a sensible way? Or maybe I’m missing something completely?

Thanks,
Eli

martindurant · January 31, 2023, 10:11pm

It is expected that each GRIB2 should return a list of reference sets, this is how the native file format is structured.
If you only actually have one output in each list, you can simply replace
f.write(ujson.dumps(out).encode()) with f.write(ujson.dumps(out[0]).encode()). Otherwise, you will need to write separate files for each piece.

Alternatively, you can return the lists of reference sets instead of writing them to JSON, and concatenate the lists before handing them to MultiZarrToZarr (return scan_grib(...)), if you have enough memory.

imcslatte · February 2, 2023, 4:51pm

Thanks,

I think I see where I’ve gone wrong.

Topic		Replies	Views
Accessing GRIB2 files as a single cloud-friendly dataset in xarray through kerchunk Data	15	3333	October 28, 2022
Trick for improving Kerchunk performance for large numbers of chunks/files Data	11	1717	February 2, 2023
FSTimeoutError with various notebooks using NOAA grib2 data	4	518	April 28, 2023
Spatially un-chunked grib2 use case : can I do something with/before Kerchunk? Data	5	611	May 9, 2023
Combine multiple grib messages into one file for reading in xarray Data	2	937	May 24, 2023

Issue accessing cloud GFS data using kerchunk

Related topics