Efficient access of ensemble data on AWS

Hi everyone,

I am a new user who is not a data professional. Thanks in advance for bearing with me.

I have made use of @rsignell’s script to download real-time NWM streamflow data from AWS. Thanks Rich!

Is there a way to efficiently access all the ensemble members? Right now I am creating new json files each time I download a new model run. For each of the 7 members, I run two sets of processes using dask to generate all the json files. I then make 7 xarray dataarrays that I concat into a single xarray dataarray.

This approach works in a reasonable amount of time, but I’m hoping someone can help me find a better way. There are many useful ensemble datasets available in the cloud, and I haven’t found any examples on best practices for accessing them.

1 Like

@mabaxter can you share the script/notebook you are using?

@rsignell 's original code is here: https://gist.github.com/rsignell-usgs/e381a55f41b87ac74c91e669afabb3cc

My code that I described above is here: https://github.com/mabaxter/NWM-AWS/blob/main/nwm_aws_realtime-ensemble.ipynb

Hi @mabaxter,

I had a quick go at this by creating a virtual dataset for each ensemble member, then appending the ensemble member number to each variable name before making a virtual dataset that contains all variables across the ensemble.


This obviously only deals with creating a virtual dataset across the ensemble, not updating when new runs are available

1 Like

Just to update this with a neater workflow, it is possible to specify a new dimension in Kerchunk, using the file names to populate it:

@Peter_Marsh , this is awesome! (Peter is a Google Summer of Code Student working on Kerchunk this summer with @martindurant and me – great to see his work helping to solve real use cases!)


Hi @Peter_Marsh - thanks for your help! This is very useful and exactly what I was looking for. And thank you for your work on the kerchunk project as a whole.

I did get the first version of your code to work for me with no problems. It is faster to create the virtual dataset rather than concatting the xarrays.

I tried to get the second code you put up working, but ran into a problem here:

mzz = MultiZarrToZarr(flist, 
                    coo_map={'ensemble' : ex},
                    concat_dims = ['ensemble'],
                    identical_dims = ['feature_id', 'reference_time', 'time'],
out = mzz.translate()

I get:

TypeError                                 Traceback (most recent call last)
Input In [40], in <cell line: 8>()
      1 mzz = MultiZarrToZarr(flist, 
      2                     remote_protocol='s3',
      3                     remote_options={'anon':True},
      6                     identical_dims = ['feature_id', 'reference_time', 'time'],
      7                      )
----> 8 out = mzz.translate()

File ~/python3/miniconda3/envs/rain2/lib/python3.10/site-packages/kerchunk/combine.py:394, in MultiZarrToZarr.translate(self, filename, storage_options)
    392 """Perform all stages and return the resultant references dict"""
    393 if 1 not in self.done:
--> 394     self.first_pass()
    395 if 2 not in self.done:
    396     self.store_coords()

File ~/python3/miniconda3/envs/rain2/lib/python3.10/site-packages/kerchunk/combine.py:200, in MultiZarrToZarr.first_pass(self)
    198 z = zarr.open_group(fs.get_mapper(""))
    199 for var in self.concat_dims:
--> 200     value = self._get_value(i, z, var, fn=self._paths[i])
    201     if isinstance(value, np.ndarray):
    202         value = value.ravel()

File ~/python3/miniconda3/envs/rain2/lib/python3.10/site-packages/kerchunk/combine.py:150, in MultiZarrToZarr._get_value(self, index, z, var, fn)
    148     o = selector[index]
    149 elif isinstance(selector, re.Pattern):
--> 150     o = selector.match(fn).groups[0]  # may raise
    151 elif not isinstance(selector, str):
    152     # constant, should be int or float
    153     o = selector

TypeError: 'builtin_function_or_method' object is not subscriptable

I ran this in a separate env where I made sure I had the latest packages, and everything looks to be working prior to this point. Can you help me out again?

Hi @mabaxter, this is most likely from your directory path differing from mine and the regex method passed to the file names no longer working . Will you share the full paths where your intermediate jsons are saved?

  import re
  ex = re.compile(r'.*?(\d+).json')
  ex.match(filename).groups[0] <- should return just the filename (i.e. 1 / ensemble member number)

Thanks for your quick reply @Peter_Marsh.

This works:


['/home/baxte1ma/notebooks/Real-time NWM/1.json',
 '/home/baxte1ma/notebooks/Real-time NWM/2.json',
 '/home/baxte1ma/notebooks/Real-time NWM/3.json',
 '/home/baxte1ma/notebooks/Real-time NWM/4.json',
 '/home/baxte1ma/notebooks/Real-time NWM/5.json',
 '/home/baxte1ma/notebooks/Real-time NWM/6.json',
 '/home/baxte1ma/notebooks/Real-time NWM/7.json']



This gives the error I was seeing (not calling the method, no parentheses following groups):


TypeError                                 Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 ex.match('/home/baxte1ma/notebooks/Real-time NWM/1.json').groups[0]

TypeError: 'builtin_function_or_method' object is not subscriptable

Oh right I also get that. I think this is a bug @martindurant has already fixed. fix regex · fsspec/kerchunk@5cfa887 · GitHub

The kerchunk version on PyPi seems to be a bit behind the latest on github. I think it is best to rather import kerchunk using the latest code available on github

pip install git+https://github.com/fsspec/kerchunk

That did the trick. Thanks guys! @Peter_Marsh this approach is very useful, and I look forward to using it for other datasets, like the ensemble GFS (GEFS).

1 Like