Accessing GRIB2 files as a single cloud-friendly dataset in xarray through kerchunk

chiaral · April 13, 2022, 10:03pm

I successfully loaded the one file following your suggestion! thanks!
I am now using the GCS deployment, which has
fsspec.__version__ '2021.11.1'
but didn’t quite work:

TLDR:
the json file is not created correctly

rpath ='jsonfiles1/acpcp_sfc_2000010100_c00.json'

s_opts = {'requester_pays':True, 'skip_instance_cache':True}
r_opts = {'anon':True}

with fsspec.open(rpath) as f:
    references = ujson.loads(f.read())

ds = xr.open_dataset("reference://", engine="zarr",
    backend_kwargs={
        "consolidated": False,
        "storage_options": dict(fo=references, ref_storage_args=s_opts, remote_protocol="s3", 
                                remote_options=r_opts, skip_instance_cache=True)
    }
)
ds

but the dataset is wrong because the json file - i believe - is wrong

Note how step, time, and valid time have become variables and not coordinates, and more importantly, it loaded only one of the values along the dimension step there should be 80.

This is how the actual file looks if I download it:

!wget https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2

ds1 = xr.open_dataset('acpcp_sfc_2000010100_c00.grib2', engine = 'cfgrib')

When I create the json files I define step as a common_vars various coordinate values, like:

so = {"anon": True, "default_cache_type": "readahead"}
out = scan_grib(files[0],common_vars = ['time', 'step', 'latitude', 'longitude', 'valid_time'], storage_options=so )
outfname = 'jsonfiles1/'+files[0].split('/')[-1][:-6]+'.json'
with open(outfname, "w") as f:
    f.write(json.dumps(out))

however when I inspect the json file I indeed have empty
chunks
ARRAY_DIMENSIONS under refs for time and step (see blue squares) compared to latitude (see green squares).

But also step/0 is empty, and other relevant fields.

Does this have to do - probably - with how this grib file is defined?

In the example linked above (the HRRR file) I didn’t pay attention to that, because it concatenates them along another dimension, but it looks like the step variable is equal to 1 so you don’t notice that.
I in fact downloaded one HRRR file and when I loaded with cfgrib the variable step had only one entry, which is very common for real time grib files.

I thought it was worth to mention it, FYI, and to know if there are ways to create the json correctly.
I tried to hack it but i am missing the
"step/0": "\u0000\u0000\u0000\u0000\u0000\u0000\b@"
which instead for latitutde has something like:
"latitude/0": ["{{u}}", 0, 325317],

I should add - that real time forecast products usually are organized like HRRR, which means they have one step per grib files. These are reforecast products (retrospective runs, used to create a reanalysis to calibrate realtime forecast), which are always a different beast, and the files can be ad-hoc. So probably the scan_grib module doesn’t know what to do with whatever way the step variable is presented in this grib file.

thanks for your time!

Topic		Replies	Views
Issue accessing cloud GFS data using kerchunk Cloud	2	636	February 2, 2023
Making kerchunk as simple as a toggle? Open Science	30	1416	August 20, 2024
Accessing nested HDF5 file from http via kerchunk Data	11	1643	January 13, 2026
Trick for improving Kerchunk performance for large numbers of chunks/files Data	11	1848	February 2, 2023
Spatially un-chunked grib2 use case : can I do something with/before Kerchunk? Data	5	654	May 9, 2023

Accessing GRIB2 files as a single cloud-friendly dataset in xarray through kerchunk

Related topics