Availability of AWS S3 CMIP6 data?

Luke_Madaus · February 12, 2021, 4:40pm

Hello all,

I’m finally getting back to some more climate model data analysis on the cloud! And, I was hoping to get some insight into what, if anything, is happening with the Zarr-format CMIP6 data under the s3://cmip6-pds bucket. I’ve been exploring transitioning some analysis to use that data, and it had been working fine for a couple of weeks. However, just this week, queries against that bucket have been lagged and files that were previously there seem to be missing. Either that, or the response is timing out before getting anything back and I get “.zmetadata” key errors (not found).

Simple AWS cli commands like:
aws s3 ls s3://cmip6-pds
sometimes return…but when they do, it often takes several minutes. Trying to list any subkeys also is very hit-or-miss.

I browsed through the Pangeo discourse board and didn’t see anything specifically about any work being done on that bucket. I emailed the AWS Sustainability Data Initiative Team, since it seemed like this behavior was more of an S3 issue. Their response was that it looked like someone was doing a massive reorganization of data in that bucket, and I may be “running into that”.

Upon closer inspection, it does look like the data is being moved around to include a new “version” subkey in the dataset paths to be more consistent with how ESGF stores the data. I was curious if anyone on the Pangeo team could provide any insight or description of what was going on with that bucket? For the moment, the bucket seems to be unusable for me, since I can’t get any stable queries against files in it.

This is an amazing resource that this group is providing, and it’s emblematic of everything I enjoy about the Pangeo community. I really hope to be able to re-engage with the community in the coming months as I get back into things again!

Thanks!

Luke Madaus

rabernat · February 12, 2021, 9:20pm

Hi Luke! Some things have moved around, but the data should be there. Everything is documented here:

https://pangeo-data.github.io/pangeo-cmip6-cloud/

In general, we do not recommend you list the bucket directly, as it is so huge that listing is very slow. Instead, use the catalogs (CSV files) to find the data you need.

edit: you can find some background about the restructuring here:

Luke_Madaus · February 12, 2021, 10:11pm

Hi Ryan! Thanks for the response. That github thread is just what I was looking for to explain what was going on.

I typically do use the CSV catalogs…the attempts to list things were a way to try and confirm what was happening when I tried to access the data. So, I’m still having difficulty getting datasets listed in the CSV catalogs to actually load. Based on that cmip6-pipeline with @naomi-henderson 's last message (which I’m assuming also applies to what’s happening in the AWS S3 cmip6-pds bucket), the CSV file I should be querying is one of two:

The pangeo-cmip6.csv file, or
The pangeo-cmip6-testing.csv file, which is updated as the new subkeys are written for the data.

I just re-ran and re-tested these queries. As an example, querying the pangeo-cmip6.csv file for NCAR/CESM2/ssp585/r10i1p1f1/Amon/ta dataset gives a zstore value of:

s3://cmip6-pds/CMIP6/ScenarioMIP/NCAR/CESM2/ssp585/r10i1p1f1/Amon/ta/gn/v20200528/

The same query against the pangeo-cmip6-testing.csv file gives a zstore value that doesn’t include the version flag on the end (which was counter to my expectations from that github thread)?:

s3://cmip6-pds/ScenarioMIP/NCAR/CESM2/ssp585/r10i1p1f1/Amon/ta/gn/

Regardless, when I try to open either path directly with xarray/zarr, I get the KeyError message that it can’t find .zmetadata:

import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.invalidate_cache() # Ensure we're refreshing our object cache
fmap = s3fs.S3Map('s3://cmip6-pds/CMIP6/ScenarioMIP/NCAR/CESM2/ssp585/r10i1p1f1/Amon/ta/gn/v20200528/', s3=fs)
dset = xarray.open_zarr(fmap, consolidated=True)

So I’m unsure if this is just quirk of the S3 bucket’s internal catalog/listing of files not keeping up, or these CSVs are being updated/rewritten somehow before the transformation is complete?

Just curious if others are also having similar problems querying and accessing the CMIP6 data from these catalogs.

Thanks, all!

naomi · February 12, 2021, 10:42pm

Sorry you are having trouble, Luke! We just finished a complete restructuring of the datasets in the Google Cloud bucket. Although the datasets and the CSV catalogs are now in sync in GC, it will take awhile for the rclone scripts which clone our GC collection on S3, to propagate the changes to the 425,000 datasets. So, for example, s3://cmip6-pds/ScenarioMIP/NCAR/CESM2/ssp126/* now exists on s3, but there no s3://cmip6-pds/CMIP6/ScenarioMIP/NCAR/CESM2/ssp585/* datasets right now. This will change, of course, as the rclone scripts keep working. It may take another week or so to complete the clone. Please be patient - I promise that this reorganization will be worth the pain. After the datasets have been updated, the CSV file on s3 should only point to existing datasets.

I would like to add that the *-testing* catalogs are obsolete, and have been deleted on GC. Please use pangeo-cmip6.csv or pangeo-cmip6-noQC.csv in the future.

Luke_Madaus · February 12, 2021, 10:59pm

Ah, thanks @naomi ! Your explanation makes a lot of sense, and my confusion has been clarified. I totally agree with you that this restructuring is definitely worth some bumps in the meantime. I’ll test my workflow pointing to the ssp126 data, but then I’m happy to wait for the process to complete.

Thanks for the tip on the *-testing* catalogs being obsolete…I’ll just use the normal catalogs from here on out.

Thanks again!

naomi · February 24, 2021, 8:37pm

The AWS S3 CMIP6 bucket restructuring is now complete (finally!) and the catalogs should reflect all of the currently available data. Thanks for you patience! If you find any discrepancies and/or suggestions, please open an issue here: pangeo-cmip6-cloud

Luke_Madaus · February 24, 2021, 9:19pm

Thank you so much, @naomi ! My test queries against the data are looking good so far. Thank you again for shepherding this whole process…

Topic		Replies	Views
Best way to access CMIP6 data Cloud or HPC (in UK)?	3	1008	November 13, 2020
Pressure level for the Pangeo CMIP6 catalog? Data	3	745	February 3, 2022
CESM Large Ensemble Analysis CMIP6 Hackathon	7	1627	December 16, 2019
Collecting grid-metric files for CMIP6 output for cloud analysis Cloud	10	1442	June 27, 2022
Pangeo Showcase: "How to transform thousands of CMIP6 datasets to zarr with Pangeo Forge - And why we should never do this again!" Pangeo Showcase	0	722	November 26, 2023

Availability of AWS S3 CMIP6 data?

Related topics