Idea: CDN for Pangeo Cloud Data

rabernat · September 2, 2021, 4:30pm

I wanted to just jot down this idea I have had for a while. Looking for feedback and pointers to existing solutions.

Within Pangeo, and especially as Pangeo Forge comes online, we are going to have different big datasets sitting on many different locations: AWS S3, Google Cloud Storage, Azure, Open Storage Network, Wasabi, maybe even Filecoin / IPFS. If I am working on a cluster in Google Cloud, I may want to be pulling data from OSN, or S3. There may be egress charges involved, or at minimum, the performance will be slower across the cloud boundaries. Pangeo workflows often involve many passes through data in object storage with a Dask cluster. During these workflows, I should never have to pull the data across the cloud boundary more than once in a single session. Instead, the data should be cached from far-away object storage to storage proximate to my Hub. Moreover, if other users are working with the same [public] data at the same time, we should only have to cache one copy.

I was starting to think about how we might implement a custom solution for this, with some sort of caching layer within fsspec. But then I realized, that this is exactly what a content delivery network (CDN) does! The difference here is that, rather than trying to serve data to users outside the cloud, we are more interested in serving users inside a particular cloud, with a CDN endpoint close to their compute environment.

I have very little experience with CDNs, so I really don’t know if this is feasible. Some questions I have:

Can a commercial CDN service like Cloudflare meet this need?
Are there open-source CDNs that we can deploy ourselves?

RaoulM · September 7, 2021, 9:03pm

Please add Oracle Cloud to your list - we are actively developing an open data platform and looking for data sets to host. Geoscience is a focus for us and we are very interested in collaborating.

In my experience most CDNs are optimized for small file web object caching. There are some open source platforms like Kurento and Jitsi that are designed for video streaming but they generally don’t play well with CDNs - I’m not sure how they would handle a large data set. Red5 Pro may be something to look at for a multi-cloud or cloud agnostic approach, but it’s not free or open source and I’m not sure how their pricing works.

It may be simpler to just seed Pangeo data on AWS, GCP, Azure, and Oracle using each platform’s open data stores and work with each vendor to make sure the data is being updated appropriately

Topic		Replies	Views
Experience with no-egress-fee object storage? Data	5	736	February 15, 2022
Pangeo Forge bakeries Cloud	21	1246	October 19, 2023
Pangeo Showcase: "FROST: Federated Registry Of Scientific Things" (Feb 12, 2025) Pangeo Showcase	3	564	February 13, 2025
I want tips on setting up a scalable Pangeo cloud environment Cloud	1	67	June 4, 2025
Pangeo Cloud Data Cookbook Cloud	5	1335	March 25, 2021

Idea: CDN for Pangeo Cloud Data

Related topics