Availability of AWS S3 CMIP6 data?

Hello all,

I’m finally getting back to some more climate model data analysis on the cloud! And, I was hoping to get some insight into what, if anything, is happening with the Zarr-format CMIP6 data under the s3://cmip6-pds bucket. I’ve been exploring transitioning some analysis to use that data, and it had been working fine for a couple of weeks. However, just this week, queries against that bucket have been lagged and files that were previously there seem to be missing. Either that, or the response is timing out before getting anything back and I get “.zmetadata” key errors (not found).

Simple AWS cli commands like:
aws s3 ls s3://cmip6-pds
sometimes return…but when they do, it often takes several minutes. Trying to list any subkeys also is very hit-or-miss.

I browsed through the Pangeo discourse board and didn’t see anything specifically about any work being done on that bucket. I emailed the AWS Sustainability Data Initiative Team, since it seemed like this behavior was more of an S3 issue. Their response was that it looked like someone was doing a massive reorganization of data in that bucket, and I may be “running into that”.

Upon closer inspection, it does look like the data is being moved around to include a new “version” subkey in the dataset paths to be more consistent with how ESGF stores the data. I was curious if anyone on the Pangeo team could provide any insight or description of what was going on with that bucket? For the moment, the bucket seems to be unusable for me, since I can’t get any stable queries against files in it.

This is an amazing resource that this group is providing, and it’s emblematic of everything I enjoy about the Pangeo community. I really hope to be able to re-engage with the community in the coming months as I get back into things again!

Thanks!

Luke Madaus

1 Like

Hi Luke! Some things have moved around, but the data should be there. Everything is documented here:

https://pangeo-data.github.io/pangeo-cmip6-cloud/

In general, we do not recommend you list the bucket directly, as it is so huge that listing is very slow. Instead, use the catalogs (CSV files) to find the data you need.

edit: you can find some background about the restructuring here:

1 Like

Hi Ryan! Thanks for the response. That github thread is just what I was looking for to explain what was going on.

I typically do use the CSV catalogs…the attempts to list things were a way to try and confirm what was happening when I tried to access the data. So, I’m still having difficulty getting datasets listed in the CSV catalogs to actually load. Based on that cmip6-pipeline with @naomi-henderson 's last message (which I’m assuming also applies to what’s happening in the AWS S3 cmip6-pds bucket), the CSV file I should be querying is one of two:

  1. The pangeo-cmip6.csv file, or
  2. The pangeo-cmip6-testing.csv file, which is updated as the new subkeys are written for the data.

I just re-ran and re-tested these queries. As an example, querying the pangeo-cmip6.csv file for NCAR/CESM2/ssp585/r10i1p1f1/Amon/ta dataset gives a zstore value of:

s3://cmip6-pds/CMIP6/ScenarioMIP/NCAR/CESM2/ssp585/r10i1p1f1/Amon/ta/gn/v20200528/

The same query against the pangeo-cmip6-testing.csv file gives a zstore value that doesn’t include the version flag on the end (which was counter to my expectations from that github thread)?:

s3://cmip6-pds/ScenarioMIP/NCAR/CESM2/ssp585/r10i1p1f1/Amon/ta/gn/

Regardless, when I try to open either path directly with xarray/zarr, I get the KeyError message that it can’t find .zmetadata:

import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.invalidate_cache() # Ensure we're refreshing our object cache
fmap = s3fs.S3Map('s3://cmip6-pds/CMIP6/ScenarioMIP/NCAR/CESM2/ssp585/r10i1p1f1/Amon/ta/gn/v20200528/', s3=fs)
dset = xarray.open_zarr(fmap, consolidated=True)

So I’m unsure if this is just quirk of the S3 bucket’s internal catalog/listing of files not keeping up, or these CSVs are being updated/rewritten somehow before the transformation is complete?

Just curious if others are also having similar problems querying and accessing the CMIP6 data from these catalogs.

Thanks, all!

Sorry you are having trouble, Luke! We just finished a complete restructuring of the datasets in the Google Cloud bucket. Although the datasets and the CSV catalogs are now in sync in GC, it will take awhile for the rclone scripts which clone our GC collection on S3, to propagate the changes to the 425,000 datasets. So, for example, s3://cmip6-pds/ScenarioMIP/NCAR/CESM2/ssp126/* now exists on s3, but there no s3://cmip6-pds/CMIP6/ScenarioMIP/NCAR/CESM2/ssp585/* datasets right now. This will change, of course, as the rclone scripts keep working. It may take another week or so to complete the clone. Please be patient - I promise that this reorganization will be worth the pain. After the datasets have been updated, the CSV file on s3 should only point to existing datasets.

I would like to add that the *-testing* catalogs are obsolete, and have been deleted on GC. Please use pangeo-cmip6.csv or pangeo-cmip6-noQC.csv in the future.

2 Likes

Ah, thanks @naomi ! Your explanation makes a lot of sense, and my confusion has been clarified. I totally agree with you that this restructuring is definitely worth some bumps in the meantime. I’ll test my workflow pointing to the ssp126 data, but then I’m happy to wait for the process to complete.

Thanks for the tip on the *-testing* catalogs being obsolete…I’ll just use the normal catalogs from here on out.

Thanks again!

1 Like

The AWS S3 CMIP6 bucket restructuring is now complete (finally!) and the catalogs should reflect all of the currently available data. Thanks for you patience! If you find any discrepancies and/or suggestions, please open an issue here: pangeo-cmip6-cloud

1 Like

Thank you so much, @naomi ! My test queries against the data are looking good so far. Thank you again for shepherding this whole process…

1 Like