Google storage gs:// URLs for Pangeo datasets on GCS

Hi Pangeo team,

I have been following the tutorial for rechunker and am trying to rechunk data onto my personal google cloud bucket. However, I would like to use the GFDL CM2.6 data here instead of the Copernicus Marine Environment which is used in the example. The tutorial gives a URL for this dataset (‘gs://pangeo-cmems-duacs’), but I don’t know where this link comes from, and I don’t know how to get the corresponding GCS URL for any of the other datasets I might be interested in.

Where can I find the Google Storage URL for other Pangeo datasets that I may be interested in (in particular the GFDL CM2.6 ocean surface datasets)?

Thanks,

Andrew

1 Like

Hi @andrewbrettin – thanks for this interesting question.

The current “official” Pangeo catalog is an Intake catalog and is managed here:


And the catalog for CM2.6 is here:

This is turned into a website here:

We intend the data to be used via intake, e.g.


from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean/GFDL_CM2.6.yaml")
ds  = cat["GFDL_CM2_6_control_ocean"].to_dask()

However, your question reveals two problems with this approach:

  • If you don’t want to open the data with xarray / dask but would rather open it directly with zarr, or just even know the actual URL on cloud storage, intake doesn’t make that easy for you
  • The catalog website also does not make that information obvious

These are two concrete things we could try to improve going forward.

I hope this helps.

1 Like