This comes up a lot, and I feel like I never have a satisfying answer: “How should I make a moderately-sized dataset (1-100GB) publicly available” ?
I thought a discourse post could be good for the record, so stealing some good ideas and resources below from @rabernat @joshmoore and others https://twitter.com/clifgray/status/1391828799105478663
-
AWS S3 or Google Cloud Storage. Hosting 2GB costs about $1 / year. (cons: tricky to configure your own cloud account and bucket permissions appropriately. If not ‘requester pays’ you pay indeterminate egress and API request fees).
-
A great comparison of current options with their caveats, pros and cons (Zenodo, Figshare, Dryad, OSF) The best free Research Data Repository - Dmytro Kryvokhyzha
-
I’m hoping that in this thread we can capture basic code examples of opening datasets stored in these various repositories.
For example, here is a quick example I put together for Zarr data served via a GitHub repo with a zenodo DOI (major caveats here is that dataset needs to be <1GB and chunks <100MB).
import xarray as xr
import fsspec
uri = 'https://scottyhq.github.io/zarrdata/air_temperature.zarr'
ds = xr.open_dataset(uri, engine="zarr", consolidated=True)
ds.air.isel(time=1).plot(x="lon")
repository: GitHub - scottyhq/zarrdata: quick test of hosting a zarr dataset
2 Likes
This is a great topic.
I think we could develop a Zarr / Zenodo integration (zarrnodo
? ) that allows you to store a single Zarr group in a Zenodo store.
I’ll leave the naming to you!, but Martin was fairly confident: zarr-developers/community - Gitter
1 Like
It would also be great if we could share Zarr datasets easily via google drive / dropbox!
We started this, but it has not really been tested and I’m not sure if it works:
2 Likes
Thanks for bringing this over from twitter to the pangeo discourse @scottyhq! I’ve come up against this a few times now where I’ve got workflows either for teaching or actual analysis and the primary distributor is really slow and poorly structured to the point where it makes it prohibitive to share. The ability to pull ~10GB subsets as zarr from zenodo or Google Drive as @rabernat recommended would be super useful. I did just test the gdrivefs and it worked for me after a little tinkering, though it was relatively slow to pull down form GDrive, 7 min for 150MB. I suppose that could’ve been partially due to Binder’s network access as well as a GDrive bottleneck. I can test some more later!
2 Likes
That fits with my experience, which is I think why I stopped working on it. I don’t know why it’s so slow, but since this is not really an “approved” use of Google Drive, it’s unlikely we’ll get any support from Google debugging it.
Someone should really try the same thing with dropbox / box / onedrive / etc., to see if any of them works well.
I’ve been using mostly OneDrive for Business / Sharepoint and it appears comparatively fast, more than 1 GB/minute
Drive is slow and flaky in my experience, certainly, when talking GB scale.