Looking for the best way to compliment, rather than compete with, this community, but commercially

rabernat · April 7, 2021, 3:17pm

This is an interesting, concrete example of the kind of convenience you are trying to provide through your platform. I think it’s worth exploring the best way to achieve this goal. Your assertion is that we should generate yet another copy of the CMIP6 data in an even more analysis ready format, because the existing Zarr cloud data is not convenient enough. This community has already worked quite hard to bring CMIP6 to the cloud in Zarr, precisely with the goal of providing a more user-friendly analysis experience! So your comments suggest that that effort has, so far, not been a total success; the data remain inaccessible and hard to work with.

There are a couple of different ways we could make the data access more convenient, working within the open-source / open-data framework to improve the way the data are stored and accessed on the cloud. With regards to your specific example, the reason the data are difficult to slice in space (within +/- 15 deg) is because of chunking. The CMIP6 Zarr data are generally chunked in time, contiguous in space. As you correctly state, this means you generally have to download the whole dataset if you want a timeseries at just a single point. But there are many specific ways we could improve this situation:

Recent technical advances in Zarr, specifically, partial chunk reads, could overcome this limitation.
Using Caterva within Zarr could have a similar impact.
We could try TileDB instead of Zarr, which is supposed to handle this situation better.
We could use much smaller chunks, to facilitate easier slicing. This is feasible because of async in zarr, another recent feature enhancement that came out of this community.
(Related) We could generate a rechunked copy of some datasets to support different modes of analysis. Rechunking data has been a major topic of discussion in this forum.

All of these are feasible if there is engineering effort devoted to them.

As for improving search and discoverability of the data, that is a major emphasis of the CMIP6 cloud / ESGF group, with many different options being considered (STAC, ElasticSearch, etc.). Creating a more user-friendly front-end would be a great service to the community.

Alternatively, one could imagine ingesting the data into a proprietary data platform, where this convenience is provided by a black box, for a fee. Perhaps it’s obvious at this point which route I favor for my own efforts. I blogged about it here:

What would be great would be to identify areas where contributing to the open-source foundations (Xarray, Zarr, STAC, etc.) of this CMIP6-in-the-cloud effort would be mutually beneficial to both your efforts and to the broader science community that laid the foundations. I tried to enumerate some very specific ideas above. To circle back to the original topic, devoting some of your company’s resources to contribute to the open-source stack and open data archive would be a great way to “compliment, rather than compete with, this community”.

A a concrete step forward, we could start exploring whether any of the ideas I enumerated above could help lower the friction for timeseries analysis of the “relatively small subset of experiments, models and variables” that you’re interested in.

Topic		Replies	Views
Access to Pangeo GCS Bucket to push model output from pre-CMIP6 experiments? Cloud	6	1141	November 21, 2019
What should a Pangeo 2.0 cloud tech stack look like? News & Announcements	12	524	September 27, 2024
What's Next - Cloud - Partner-managed Infrastructure	23	1130	March 13, 2024
Using ocean.pangeo.io for the CMIP6 Hackathon CMIP6 Hackathon	1	1553	November 5, 2019
Migration of ocean.pangeo.io User Accounts Cloud	25	2267	September 27, 2020

Looking for the best way to compliment, rather than compete with, this community, but commercially

Related topics