I wanted to just jot down this idea I have had for a while. Looking for feedback and pointers to existing solutions.
Within Pangeo, and especially as Pangeo Forge comes online, we are going to have different big datasets sitting on many different locations: AWS S3, Google Cloud Storage, Azure, Open Storage Network, Wasabi, maybe even Filecoin / IPFS. If I am working on a cluster in Google Cloud, I may want to be pulling data from OSN, or S3. There may be egress charges involved, or at minimum, the performance will be slower across the cloud boundaries. Pangeo workflows often involve many passes through data in object storage with a Dask cluster. During these workflows, I should never have to pull the data across the cloud boundary more than once in a single session. Instead, the data should be cached from far-away object storage to storage proximate to my Hub. Moreover, if other users are working with the same [public] data at the same time, we should only have to cache one copy.
I was starting to think about how we might implement a custom solution for this, with some sort of caching layer within fsspec. But then I realized, that this is exactly what a content delivery network (CDN) does! The difference here is that, rather than trying to serve data to users outside the cloud, we are more interested in serving users inside a particular cloud, with a CDN endpoint close to their compute environment.
I have very little experience with CDNs, so I really don’t know if this is feasible. Some questions I have:
- Can a commercial CDN service like Cloudflare meet this need?
- Are there open-source CDNs that we can deploy ourselves?