Recommendations for self-hosted <100GB datasets

scottyhq · May 12, 2021, 7:01am

This comes up a lot, and I feel like I never have a satisfying answer: “How should I make a moderately-sized dataset (1-100GB) publicly available” ?

I thought a discourse post could be good for the record, so stealing some good ideas and resources below from @rabernat @joshmoore and others https://twitter.com/clifgray/status/1391828799105478663

AWS S3 or Google Cloud Storage. Hosting 2GB costs about $1 / year. (cons: tricky to configure your own cloud account and bucket permissions appropriately. If not ‘requester pays’ you pay indeterminate egress and API request fees).
A great comparison of current options with their caveats, pros and cons (Zenodo, Figshare, Dryad, OSF) The best free Research Data Repository - Dmytro Kryvokhyzha
I’m hoping that in this thread we can capture basic code examples of opening datasets stored in these various repositories.

For example, here is a quick example I put together for Zarr data served via a GitHub repo with a zenodo DOI (major caveats here is that dataset needs to be <1GB and chunks <100MB).

import xarray as xr
import fsspec
uri = 'https://scottyhq.github.io/zarrdata/air_temperature.zarr'
ds = xr.open_dataset(uri, engine="zarr", consolidated=True)
ds.air.isel(time=1).plot(x="lon")

repository: GitHub - scottyhq/zarrdata: quick test of hosting a zarr dataset

joshmoore · May 12, 2021, 11:01am

As I just added to twitter, I think OSF is also out without some form of redirection (e.g. GitHub - intake/fsspec-reference-maker: Functions to make reference descriptions for ReferenceFileSystem) ~J.

rabernat · May 12, 2021, 1:11pm

This is a great topic.

I think we could develop a Zarr / Zenodo integration (zarrnodo ? ) that allows you to store a single Zarr group in a Zenodo store.

joshmoore · May 12, 2021, 1:59pm

I’ll leave the naming to you!, but Martin was fairly confident: zarr-developers/community - Gitter

rabernat · May 12, 2021, 2:02pm

It would also be great if we could share Zarr datasets easily via google drive / dropbox!

We started this, but it has not really been tested and I’m not sure if it works:

patrickcgray · May 12, 2021, 10:15pm

Thanks for bringing this over from twitter to the pangeo discourse @scottyhq! I’ve come up against this a few times now where I’ve got workflows either for teaching or actual analysis and the primary distributor is really slow and poorly structured to the point where it makes it prohibitive to share. The ability to pull ~10GB subsets as zarr from zenodo or Google Drive as @rabernat recommended would be super useful. I did just test the gdrivefs and it worked for me after a little tinkering, though it was relatively slow to pull down form GDrive, 7 min for 150MB. I suppose that could’ve been partially due to Binder’s network access as well as a GDrive bottleneck. I can test some more later!

rabernat · May 13, 2021, 1:11pm

That fits with my experience, which is I think why I stopped working on it. I don’t know why it’s so slow, but since this is not really an “approved” use of Google Drive, it’s unlikely we’ll get any support from Google debugging it.

Someone should really try the same thing with dropbox / box / onedrive / etc., to see if any of them works well.

byersiiasa · May 14, 2021, 5:22am

I’ve been using mostly OneDrive for Business / Sharepoint and it appears comparatively fast, more than 1 GB/minute

RichardScottOZ · May 18, 2021, 10:14pm

Drive is slow and flaky in my experience, certainly, when talking GB scale.

Topic		Replies	Views
Data Provider Strategies for Hosting Different Cloud-Optimized Data Formats Cloud	8	829	October 2, 2023
Recommendation for hosting cloud-optimized data Data	15	2770	January 21, 2022
Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray" Pangeo Showcase machine-learning	6	925	October 25, 2024
Any suggestions s3 upload optimizations for large 3d zarr datasets Data	11	1258	May 28, 2022
Suggested database for large amount of NetCDF data Data	13	2888	April 7, 2022

Recommendations for self-hosted <100GB datasets

Related topics