Reading ERA5 data on Planetary Computer

I’ve started reading ERA5 data on the Planetary Computer. I first tried using stac_load as I’ve used with other collection items but kept getting an empty dataset out. Trying to then follow the example notebook explicitly, I did replicate its output. I note it uses the now deprecated .get_all_items() method, and I like to try to use updated API calls where possible, but the updated method returns a generator so you can’t access with item = items[0] and I wanted to repeat the example notebook as faithfully as possible in the first case. But anyway, I did get it to work.

The example picks the first item with item = items[0] and then iterates through the assets to then combine them into a Dataset of multiple variables. This feels kinda clunky. Is this really the best approach?

Noting that this was just for one month/item, for multiple months would you have to iterate over each item and then each asset and then .combine_by_coords?

The **asset.extra_fields["xarray:open_kwargs"] parameter seems crucial to success. Looking at the item, I can see this attribute for each asset thus:
{'xarray:open_kwargs': {'chunks': {}, 'engine': 'zarr', 'consolidated': True, 'storage_options': {'account_name': 'cpdataeuwest', 'credential': 'blahblahblah'}}}. I did wonder how it knew to automatically return dask arrays. But this feels like a “not very stac like” undocumented means of access that you have to kinda know about?

Aside from any specific questions above, I’m left with general musings such as:

  • Does the example use xarray’s open_dataset specifically because of the presence of this magic field?
  • Is this magic field there just to support direct open_dataset access?
  • Does stac_load not support zarr?
  • Can the data be loaded using stac_load, or is it “Nope, you’ve got to use xarray’s open_dataset directly”?
  • Any recommended reading to learn more about this open_dataset access pattern that makes use of this extra_fields parameter that’s stored in data?

Pinging @TomAugspurger (sorry :wink: ) for much valued MSPC viewpoint!

Huh, I really thought I had updated all those notebooks, but apparently not. The equivalent method call would be either next(search.items()) or search.item_collection()[0]. I’ll get to that sooner or later.

The example picks the first item with item = items[0] and then iterates through the assets to then combine them into a Dataset of multiple variables. This feels kinda clunky. Is this really the best approach?

For now, yes. But see Convenience methods for converting STAC objects / linked assets to data containers · Issue #846 · stac-utils/pystac · GitHub and linked issues / repos.

Noting that this was just for one month/item, for multiple months would you have to iterate over each item and then each asset and then .combine_by_coords?

I think so, yes. These are separate Zarr datasets, and I think that’s the preferred way to combine them.

But this feels like a “not very stac like” undocumented means of access that you have to kinda know about?

@jsignell’s xpystac library helps with this, I think. It’ll be something like xr.open_datset(stac_asset), and it’ll take care of inferring the right engine (based on the media type in the STAC metadata). As for the storage_options, it’s kind of unavoidable. adlfs needs an account name, and this is a private storage container so you need a credential. That either goes in the STAC metadata (like account_name), gets injected by another library (like planetary_computer.sign injecting credential), or is specified by the user.

So I wouldn’t really call it “magic”. It’s simply the keyword arguments you would need to manually supply to read the data.

  • Does stac_load not support zarr?

Is this from odc.stac? I don’t think it supports zarr

  • Any recommended reading to learn more about this open_dataset access pattern that makes use of this extra_fields parameter that’s stored in data?

The STAC extension is at GitHub - stac-extensions/xarray-assets: This extension helps users open STAC Assets with xarray. It gives a place for catalog maintainers to specify various required or recommended options..


And just a general FYI about Planetary Computer, that dataset is currently not updating. I have a task item to get it moved to a new pipeline, but ran into some issues and haven’t gotten back to it.

1 Like

Super helpful reply, thanks @TomAugspurger ! I found the stac-extensions link particularly helpful.

I’d noticed the dataset only seems to run up to the end of 2020. I thought I’d seen another source suggest it only ran up to 2020 (despite ERA5 supposedly being up to present day). Is there a timescale for fixing this?

Does anyone know of ERA5 hosted on a STAC API anywhere else?
(Also I’ll take recommendations for such sources of weather/climate data for UK, with emphasis on precipitation, but also insolation, wind, and temperature).

just here to ask the same questions, what is it with MPC only going up to 2020? it’s the same with GHRSST

As I mentioned, I have a task to update the data pipeline but (still) haven’t gotten to it. There were some complications with small floating-point differences from what we already have that I haven’t tracked down.

gosh I missed that, apologies. There’s so much celebration about this being the ultimate way to do things I’m sorry to hear that it really comes down to one person with a todo list for years-out-of-date standard products. I hope there is support coming? Is Microsoft actually contributing to this in a real way?

I have pipelined reprocessing GHRSST into COG and with help from a cloud expert we are exploring pushing that up to source.coop … to me that’s a better solution than Zarr (which doesn’t have overviews afaict), so I wonder if we should discuss the issues a bit more directly. I was a bit surprised to find ERA5 also not up to date, and I had diverted to accessing that on our HPC system rather than MPC.

(edit: oh I see you work for microsoft, ok I need to get across the issues here more )