My question might sound confused and perhaps not well defined, because I am confused and a bit lost!
I report all the details below, but essentially TLDR:
Q1: How do I adapt the kerchunk example for grib2 files to my case, and specifically, to a simple case?
In particular:
syntax seems to have changed from the linked example, and I am not sure where to get the details to update kerchunk and fsspec calls.
I know these two packages are very much in flux and constantly moving along, and I am very thankful for their development! I just really want to try to use this trick, but I am lost.
Long Version:
I found this notebook that is some type of previous/further iteration of the example on kerchunk documentation page
In this notebook - based on my limited understanding - we create json
files for each grib
files, then we concatenate the json
files using kerchunk.combine.MultiZarrToZarr
and then we magically read them in xarray
This is when i list all the json
files previously created and then concatenate them
flist2 = fs2.ls(json_dir)
furls = sorted(['s3://'+f for f in flist2])
mzz = MultiZarrToZarr(furls,
storage_options={'anon':False},
remote_protocol='s3',
remote_options={'anon': True},
xarray_concat_args={'dim': 'valid_time'})
mzz.translate('hrrr_best.json')
This code doesn’t work straight away because of changes in fsspec
syntax which I somewhat pinned down.
Then the cloud friendly dataset is created as:
rpath = 's3://esip-qhub-public/noaa/hrrr/hrrr_best.json'
s_opts = {'requester_pays':True, 'skip_instance_cache':True}
r_opts = {'anon':True}
fs = fsspec.filesystem("reference", fo=rpath, ref_storage_args=s_opts,
remote_protocol='s3', remote_options=r_opts)
m = fs.get_mapper("")
ds2 = xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False),
chunks={'valid_time':1})
Now I overall follow the process, but I am completely lost when it comes to details (and some of the syntax that has changed that I don’t even know where to begin with looking for it).
My example
I want to extract some files from another aws dataset, GEFSv12 reforecast.
My case is simpler, I don’t have to update the time/day of the files, they are retrospective.
I am operating on the Pangeo AWS deployment (aws-uswest2.pangeo).
!pip install kerchunk
then
import xarray as xr
import hvplot.xarray
import datetime as dt
import pandas as pd
import dask
import panel as pn
import json
import fsspec
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr
import os
fs = fsspec.filesystem('s3', anon=True, skip_instance_cache=True)
dates = pd.date_range(start='2000-01-01',end='2001-01-2', freq='1D')
files = [date.strftime('s3://noaa-gefs-retrospective/GEFSv12/reforecast/%Y/%Y%m%d00/c00/Days:1-10/acpcp_sfc_%Y%m%d00_c00.grib2') for date in dates]
I create just two json
files, for the first and second item of the list.
so = {"anon": True, "default_cache_type": "readahead"}
outfname = 'jsonfiles/'+files[0].split('/')[-1][:-6]+'.json'
out = scan_grib(files[0],common_vars = ['time', 'step', 'latitude', 'longitude', 'valid_time'], storage_options=so )
with open(outfname, "w") as f:
f.write(json.dumps(out))
and
so = {"anon": True, "default_cache_type": "readahead"}
outfname = 'jsonfiles/'+files[1].split('/')[-1][:-6]+'.json'
out = scan_grib(files[1],common_vars = ['time', 'step', 'latitude', 'longitude', 'valid_time'], storage_options=so )
with open(outfname, "w") as f:
f.write(json.dumps(out))
NOTE: Differently from Rich’s example, I don’t write my json file to a bucket, I simply save them in my directory (is this the problem? i don’t think it should, because the json file has the location of the grib file)
I then locate/list the files, mimicking the notebook linked above
flist2 = os.listdir('jsonfiles/')
furls = sorted(['jsonfiles/'+f for f in flist2])
print(flist2)
print(furls)
I then get to the last part and there i am completely lost.
the way to pass the keyword arguments has changed, and I am not sure how to pass the various values to fsspec
Note that I am not using the kerchunk.combine.MultiZarrToZarr
here but simply i am passing one json file.
The syntax directly from the example notebook looks like:
rpath = 's3://esip-qhub-public/noaa/hrrr/hrrr_best.json'
s_opts = {'requester_pays':True, 'skip_instance_cache':True}
r_opts = {'anon':True}
fs = fsspec.filesystem("reference", fo=rpath, ref_storage_args=s_opts,
remote_protocol='s3', remote_options=r_opts)
m = fs.get_mapper("")
ds2 = xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False))
The merged json
file is in a bucket, mine is in a local directory, and then there are many arguments and keyword arguments that are now not recognized.
When I try to add this part to my notebook, I got to the following step:
rpath = furls[1]
s_opts = {'requester_pays':True, 'skip_instance_cache':True}
r_opts = {'anon':True}
fs = fsspec.filesystem('s3', fo=rpath, ref_storage_args=s_opts,
remote_protocol='s3', remote_options=r_opts)
m = fs.get_mapper("")
ds2 = xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False),
chunks={'valid_time':1})
that yells at me (among other things):
TypeError: __init__() got an unexpected keyword argument 'fo'
which tells me that the keyword arguments have changed in syntax.
I guess my overall goal would be to do what Rich does in his notebook, but without having to write to a bucket the json
files, and of course updating the necessary syntax.
Does it make any sense?
Can anyone help with ironing out the details?