Pangeo Showcase: "How to transform thousands of CMIP6 datasets to zarr with Pangeo Forge - And why we should never do this again!"


Title: “How to transform thousands of CMIP6 datasets to Zarr with Pangeo Forge and why we should never do this again!”
Invited Speakers: Julius Busecke (ORCID: 0000-0001-8571-865X), Charles Stern (ORCID: 0000-0002-4078-0852)
When: Wednesday, Nov 29, 12PM EST
Where: Launch Meeting - Zoom
The Pangeo CMIP6 working group has maintained an analysis-ready cloud optimized (ARCO) zarr copy of hundreds of thousands of CMIP6 datasets, but until now the process was incredibly work intensive and manual. While retracted datasets were removed, many of the newly available and requested datasets were not added to the ARCO zarr stores…until now.

Using the newest version of Pangeo-Forge based on Apache-Beam we are able to ingest and transform thousands of datasets from the ESGF catalog into ARCO zarr stores based on user requests. To realize this workflow we have implemented several features like dynamic (at recipe runtime) chunking, testing, and cataloging as part of Apache-Beam pipelines.

We welcome new dataset requests to scale this operation further and increase the impact of the cloud based CMIP6 data. Despite the successes we had in ingestion of the datasets, I will highlight the need for future CMIP generations to be delivered in a cloud native way, without the need for efforts like this.

  • 20 minutes - Community Showcase
  • 40 minutes - Showcase discussion/Community check-ins