Access to Pangeo GCS Bucket to push model output from pre-CMIP6 experiments?

Making pre-CMIP6 data easily accessible via the Pangeo cloud

Request: Putting 300+ Gb of pre-CMIP6 model output for key 2D (surface or TOA) variables at monthly resolution into the Pangeo cloud

From my back of the hand calcuations, ~300-500 Gb should be enough for:

  • tas, uas, vas, psl, pr, sic, rlut, rsdt, and rsut at
  • monthly resolution for the
  • historical, 1pctCO2, piControl, and either 2xCO2 or 4xCO2 experiments for
  • the IPCC FAR, IPCC SAR, IPCC TAR, CMIP3, and CMIP5

I am open to cutting / adding experiments, variables, or realizations as necessary.

Forced experiments might be worth considering but you lose direct comparability as scenarios evolved from ad-hoc to A/B to SRES to RCP to SSP.

Details

As part of the CMIP6 Hackathon, our group [Github repository] did some very basic comparisons of model skill between the CMIP6 ensemble and models from the first three IPCC reports (vaguely pre-CMIP). Here is a basic example of the kinds of analysis we can do with this data:

We were able to do this by pushing <10 Gb of model output to a personal GCS bucket running on a free trial. To extend this analysis to other variables (tas, uas, vas, pr, psl, sic, rlut, rsdt, rsut) and the higher-resolution pre-CMIP6 ensembles from CMIP3 and CMIP5, we would need ~300 Gb to 1 Tb, depending on exactly how many experiments, realizations, and variables to include.

Given how small these numbers are compared to what Pangeo already has for CMIP6 and how useful this data would be to the climate science community, I hope the Pangeo leadership will consider accommodating this request in some form.

Data format and catalog

All of the data is currently sitting in my person bucket gs://cmip6hack-multigen-zarr and is catalogued in the csv file gs://cmip6hack-multigen-zarr/pre-cmip-zarr-consolidated-stores.csv, which is formatted identically to the CMIP6 catalog and can thus is trivially easy to read in using intake-esm (I have a private repository where I am doing this and am happy to share with anyone offline).

A second (currently private) github repository contains all of the pre-processing steps:

  • downloading the data from online archives (ipcc-data.org for pre-CMIP3 output and ESGF for CMIP3 and CMIP5)
  • reading the raw output in its native format (binary, GRIB, or NetCDF4) and re-processing it into zarr files
  • pushing to my private GCS bucket

I understand that the Pangeo GCS is somewhat in flux at the moment as the NSF award is running out and datasets are being removed or catalogued. Happy to revisit this in the future if now is not a good time!

Best,
Henri

1 Like

Paging Pangeo folks @jhamman @rabernat @naomi-henderson @naomi and our CMIP6 Hackathon team @bradyrx @brian-rose @amv10070

@hdrake1 , Good timing. I was just showing your hackathon notebook to our group today, and many expressed an interest in accessing your pre-CMIP3 output. We felt a little guilty since you probably have to pay data egress charges yourself, but agree this would be very useful.

@rabernat , do you think Shane would allow us to add the other CMIPs to gs://cmip6 bucket? If not, would we be able to host this on a Pangeo bucket. I would be willing to work with Henri to get this done, in fact it should be fairly quick and easy since he has already worked out the pre-processing and, as he says, it is nearly identical to the CMIP6 (except much, much smaller). Since it is a limited collection of CMIP5 (and will be modified by converting to zarr), I don’t think the licensing issue is a huge concern.

Forgive my curt response…proposal due tomorrow.

YES! Comparison with past CMIP is important, and it’s not a ton of data. Let’s put it in the CMIP6 bucket.

Thanks so much Henri for spearheading this and Naomi for making it happen!

1 Like

If it helps with the politics of which bucket the data goes into at all, I am happy to handle the IPCC FAR, SAR, and TAR data (~30 Gb) myself. I can probably find some funds to pay for it. In the 1-year short term, I still have $299 of my free-trial :slight_smile:

@naomi, if I understand correctly, I have to pay both when I upload the data to the bucket and anytime anyone accesses it (is this what you mean by egress charges)? I’ve tried to make pricing estimates from the GCP website but I don’t really understand the terminology well enough to be confident in them…

Data egress (transfer out) (to non-Google sites) is $0.12 per G, data ingress (transfer in) is free.
So if I download the IPCC FAR, SAR and TAR 30G to my laptop, your account would be charged $3.60. When you uploaded the data, there was no extra charge. However, if I transfer data from your google bucket to the cmip6 bucket, it is free. So one time, it is pretty trivial. But if many folks are accessing your data from jupyter notebooks which are running in google cloud, it could add up pretty fast.

Thank you for the information and I agree it would be simplest to put all of the data in the Pangeo bucket.

My understanding of the table (quoted below) from the GCP compute pricelist (and in the storage pricelist) is that if the jupyter notebooks are running in google cloud (e.g. using Google Kubernetes Engine) in the same region, then there is no charge for data egress.

Egress to a different Google Cloud Platform service within the same region using an external IP address or an internal IP address, except for Cloud Memorystore for Redis, Cloud Filestore, and Cloud SQL
===> No charge

If this is true, it would be nice to somehow be able to limit access to only GCP services in the same region. Alternatively, one might be able to switch to a ‘requester-pays’ model so that anyone can still access it but it is 1) free to GCP services in the same region and 2) anyone outside of the region or not on a GCP service has to pay the egress fee to access it. Upon first inspection, I think intake-esm and other utilities would need to be updated to accommodate the requester-pays API.

@naomi, happy to continue chatting off line (hdrake at mit.edu or Henri Drake in the cmip6hackers slack) and coordinate the best way to get this pre-CMIP6 output into the pangeo CMIP6 bucket.