Idea: CDN for Pangeo Cloud Data

I wanted to just jot down this idea I have had for a while. Looking for feedback and pointers to existing solutions.

Within Pangeo, and especially as Pangeo Forge comes online, we are going to have different big datasets sitting on many different locations: AWS S3, Google Cloud Storage, Azure, Open Storage Network, Wasabi, maybe even Filecoin / IPFS. If I am working on a cluster in Google Cloud, I may want to be pulling data from OSN, or S3. There may be egress charges involved, or at minimum, the performance will be slower across the cloud boundaries. Pangeo workflows often involve many passes through data in object storage with a Dask cluster. During these workflows, I should never have to pull the data across the cloud boundary more than once in a single session. Instead, the data should be cached from far-away object storage to storage proximate to my Hub. Moreover, if other users are working with the same [public] data at the same time, we should only have to cache one copy.

I was starting to think about how we might implement a custom solution for this, with some sort of caching layer within fsspec. But then I realized, that this is exactly what a content delivery network (CDN) does! The difference here is that, rather than trying to serve data to users outside the cloud, we are more interested in serving users inside a particular cloud, with a CDN endpoint close to their compute environment.

I have very little experience with CDNs, so I really don’t know if this is feasible. Some questions I have:

  • Can a commercial CDN service like Cloudflare meet this need?
  • Are there open-source CDNs that we can deploy ourselves?

Please add Oracle Cloud to your list - we are actively developing an open data platform and looking for data sets to host. Geoscience is a focus for us and we are very interested in collaborating.

In my experience most CDNs are optimized for small file web object caching. There are some open source platforms like Kurento and Jitsi that are designed for video streaming but they generally don’t play well with CDNs - I’m not sure how they would handle a large data set. Red5 Pro may be something to look at for a multi-cloud or cloud agnostic approach, but it’s not free or open source and I’m not sure how their pricing works.

It may be simpler to just seed Pangeo data on AWS, GCP, Azure, and Oracle using each platform’s open data stores and work with each vendor to make sure the data is being updated appropriately

