Pangeo CMIP6 Catalog

Hello everyone,

I am using intake-esm and the Pangeo CMIP6 catalog to access CMIP6 model output which is working great so far. So, first off, thanks to everyone involved creating these tools!

I have a question about the completeness of the catalog with respect to the data on the ESGF server. Specifically, I am interested in the output of the HadGEM3-GC31-MM model and I am looking for, among others, data of surface downwelling longwave radiation (output variable name rlds). On the ESGF server, I can find 4 datasets (for 4 ensemble members) with daily and montly output, and one dataset with 3-hourly output. Unfortunately, I am not able to locate the daily and datasets in the Pangeo catalog. The following query results in an empty data frame.

import intake

url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)

# query
query = dict(
    activity_id="CMIP",
    experiment_id="historical",
    source_id="HadGEM3-GC31-MM",
    variable_id=["rlds"],
    table_id="day"
)

cat_daily = col.search(**query)
cat_daily.df

All other variables I am interested in (including surface downward shortwave radiation) are available with at least daily resolution. Therefore, I am wondering what are the criteria for datasets to be included in the Pangeo Catalog?

The complete ESGF CMIP6 catalog is ~ 20 PB. We have about 1 PB of data in Zarr format.

The data were populated based on a user-request form that is no longer supported. The process of ingesting data into the cloud is manual and relied on the heroic efforts of a scientists who has since retired.

We are trying to transition to a more sustainable system for continuing to expand the cloud data. It’s a big job because of the size and complexity of CMIP6.

You can get more details here - Pangeo / ESGF Cloud Data Working Group — Pangeo / ESGF Cloud Data Working Group documentation

If you’re interested in joining the working group to help find a solution, you are welcome!

Thanks for the explanation, @rabernat! I figured that it had something to do with storage capacity. I guess in my case it’s easiest just pull the missing dataset straight from ESGF.

1 Like