Efficient access of ensemble data on AWS

mabaxter · June 16, 2022, 1:17am

Hi everyone,

I am a new user who is not a data professional. Thanks in advance for bearing with me.

I have made use of @rsignell’s script to download real-time NWM streamflow data from AWS. Thanks Rich!

Is there a way to efficiently access all the ensemble members? Right now I am creating new json files each time I download a new model run. For each of the 7 members, I run two sets of processes using dask to generate all the json files. I then make 7 xarray dataarrays that I concat into a single xarray dataarray.

This approach works in a reasonable amount of time, but I’m hoping someone can help me find a better way. There are many useful ensemble datasets available in the cloud, and I haven’t found any examples on best practices for accessing them.

ktyle · June 17, 2022, 2:14pm

@mabaxter can you share the script/notebook you are using?

mabaxter · June 17, 2022, 2:24pm

@rsignell 's original code is here: https://gist.github.com/rsignell-usgs/e381a55f41b87ac74c91e669afabb3cc

My code that I described above is here: https://github.com/mabaxter/NWM-AWS/blob/main/nwm_aws_realtime-ensemble.ipynb

Peter_Marsh · June 23, 2022, 4:00pm

Hi @mabaxter,

I had a quick go at this by creating a virtual dataset for each ensemble member, then appending the ensemble member number to each variable name before making a virtual dataset that contains all variables across the ensemble.

https://nbviewer.org/gist/peterm790/f50179e19ca54de63896d7340bd5c878

This obviously only deals with creating a virtual dataset across the ensemble, not updating when new runs are available

Peter_Marsh · June 28, 2022, 8:39am

Just to update this with a neater workflow, it is possible to specify a new dimension in Kerchunk, using the file names to populate it:

rsignell · June 28, 2022, 11:38am

@Peter_Marsh , this is awesome! (Peter is a Google Summer of Code Student working on Kerchunk this summer with @martindurant and me – great to see his work helping to solve real use cases!)

mabaxter · June 28, 2022, 8:58pm

Hi @Peter_Marsh - thanks for your help! This is very useful and exactly what I was looking for. And thank you for your work on the kerchunk project as a whole.

I did get the first version of your code to work for me with no problems. It is faster to create the virtual dataset rather than concatting the xarrays.

I tried to get the second code you put up working, but ran into a problem here:

mzz = MultiZarrToZarr(flist, 
                    remote_protocol='s3',
                    remote_options={'anon':True},
                    coo_map={'ensemble' : ex},
                    concat_dims = ['ensemble'],
                    identical_dims = ['feature_id', 'reference_time', 'time'],
                     )
out = mzz.translate()

I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [40], in <cell line: 8>()
      1 mzz = MultiZarrToZarr(flist, 
      2                     remote_protocol='s3',
      3                     remote_options={'anon':True},
   (...)
      6                     identical_dims = ['feature_id', 'reference_time', 'time'],
      7                      )
----> 8 out = mzz.translate()

File ~/python3/miniconda3/envs/rain2/lib/python3.10/site-packages/kerchunk/combine.py:394, in MultiZarrToZarr.translate(self, filename, storage_options)
    392 """Perform all stages and return the resultant references dict"""
    393 if 1 not in self.done:
--> 394     self.first_pass()
    395 if 2 not in self.done:
    396     self.store_coords()

File ~/python3/miniconda3/envs/rain2/lib/python3.10/site-packages/kerchunk/combine.py:200, in MultiZarrToZarr.first_pass(self)
    198 z = zarr.open_group(fs.get_mapper(""))
    199 for var in self.concat_dims:
--> 200     value = self._get_value(i, z, var, fn=self._paths[i])
    201     if isinstance(value, np.ndarray):
    202         value = value.ravel()

File ~/python3/miniconda3/envs/rain2/lib/python3.10/site-packages/kerchunk/combine.py:150, in MultiZarrToZarr._get_value(self, index, z, var, fn)
    148     o = selector[index]
    149 elif isinstance(selector, re.Pattern):
--> 150     o = selector.match(fn).groups[0]  # may raise
    151 elif not isinstance(selector, str):
    152     # constant, should be int or float
    153     o = selector

TypeError: 'builtin_function_or_method' object is not subscriptable

I ran this in a separate env where I made sure I had the latest packages, and everything looks to be working prior to this point. Can you help me out again?

Peter_Marsh · June 29, 2022, 8:03am

Hi @mabaxter, this is most likely from your directory path differing from mine and the regex method passed to the file names no longer working . Will you share the full paths where your intermediate jsons are saved?

  import re
  ex = re.compile(r'.*?(\d+).json')
  ex.match(filename).groups[0] <- should return just the filename (i.e. 1 / ensemble member number)

mabaxter · June 29, 2022, 1:13pm

Thanks for your quick reply @Peter_Marsh.

This works:

flist

['/home/baxte1ma/notebooks/Real-time NWM/1.json',
 '/home/baxte1ma/notebooks/Real-time NWM/2.json',
 '/home/baxte1ma/notebooks/Real-time NWM/3.json',
 '/home/baxte1ma/notebooks/Real-time NWM/4.json',
 '/home/baxte1ma/notebooks/Real-time NWM/5.json',
 '/home/baxte1ma/notebooks/Real-time NWM/6.json',
 '/home/baxte1ma/notebooks/Real-time NWM/7.json']

ex.match(flist[0]).groups()[0]

'1'

This gives the error I was seeing (not calling the method, no parentheses following groups):

ex.match(flist[0]).groups[0]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 ex.match('/home/baxte1ma/notebooks/Real-time NWM/1.json').groups[0]

TypeError: 'builtin_function_or_method' object is not subscriptable

Peter_Marsh · June 29, 2022, 3:32pm

Oh right I also get that. I think this is a bug @martindurant has already fixed. fix regex · fsspec/kerchunk@5cfa887 · GitHub

The kerchunk version on PyPi seems to be a bit behind the latest on github. I think it is best to rather import kerchunk using the latest code available on github

rsignell · June 29, 2022, 3:55pm

pip install git+https://github.com/fsspec/kerchunk

mabaxter · June 30, 2022, 1:49am

That did the trick. Thanks guys! @Peter_Marsh this approach is very useful, and I look forward to using it for other datasets, like the ensemble GFS (GEFS).

rsignell · April 26, 2023, 6:43pm

Here’s a notebook that uses kerchunk to create a virtual GEFS reforecast dataset with step,ensemble and time dimensions.

It results in a dataset for each group of grib variables at the same levels that looks like this (for 3 forecast cycles):

And here’s a snapshot just to show it’s working:

Topic		Replies	Views
Accessing GRIB2 files as a single cloud-friendly dataset in xarray through kerchunk Data	15	3276	October 28, 2022
Kerchunk planning News & Announcements	36	1160	April 14, 2024
Reading larger than memory HDF data and writing concatenated xarray (or Zarr) dataset on HPC Data	13	2417	October 8, 2020
Making kerchunk as simple as a toggle? Open Science	30	1263	August 20, 2024
Best Practices for Storing EURO-CORDEX CMIP6 Datasets on S3 (Zarr, Icechunk, Kerchunk) Data zarr	5	77	September 11, 2025

Efficient access of ensemble data on AWS

Related topics