Making pre-CMIP6 data easily accessible via the Pangeo cloud
Request: Putting 300+ Gb of pre-CMIP6 model output for key 2D (surface or TOA) variables at monthly resolution into the Pangeo cloud
From my back of the hand calcuations, ~300-500 Gb should be enough for:
monthlyresolution for the
piControl, and either
- the IPCC FAR, IPCC SAR, IPCC TAR, CMIP3, and CMIP5
I am open to cutting / adding experiments, variables, or realizations as necessary.
Forced experiments might be worth considering but you lose direct comparability as scenarios evolved from ad-hoc to A/B to SRES to RCP to SSP.
As part of the CMIP6 Hackathon, our group [Github repository] did some very basic comparisons of model skill between the CMIP6 ensemble and models from the first three IPCC reports (vaguely pre-CMIP). Here is a basic example of the kinds of analysis we can do with this data:
We were able to do this by pushing <10 Gb of model output to a personal GCS bucket running on a free trial. To extend this analysis to other variables (tas, uas, vas, pr, psl, sic, rlut, rsdt, rsut) and the higher-resolution pre-CMIP6 ensembles from CMIP3 and CMIP5, we would need ~300 Gb to 1 Tb, depending on exactly how many experiments, realizations, and variables to include.
Given how small these numbers are compared to what Pangeo already has for CMIP6 and how useful this data would be to the climate science community, I hope the Pangeo leadership will consider accommodating this request in some form.
Data format and catalog
All of the data is currently sitting in my person bucket
gs://cmip6hack-multigen-zarr and is catalogued in the csv file
gs://cmip6hack-multigen-zarr/pre-cmip-zarr-consolidated-stores.csv, which is formatted identically to the CMIP6 catalog and can thus is trivially easy to read in using intake-esm (I have a private repository where I am doing this and am happy to share with anyone offline).
A second (currently private) github repository contains all of the pre-processing steps:
- downloading the data from online archives (ipcc-data.org for pre-CMIP3 output and ESGF for CMIP3 and CMIP5)
- reading the raw output in its native format (binary, GRIB, or NetCDF4) and re-processing it into zarr files
- pushing to my private GCS bucket
I understand that the Pangeo GCS is somewhat in flux at the moment as the NSF award is running out and datasets are being removed or catalogued. Happy to revisit this in the future if now is not a good time!