This comes up a lot, and I feel like I never have a satisfying answer: “How should I make a moderately-sized dataset (1-100GB) publicly available” ?
I thought a discourse post could be good for the record, so stealing some good ideas and resources below from @rabernat @joshmoore and others https://twitter.com/clifgray/status/1391828799105478663
AWS S3 or Google Cloud Storage. Hosting 2GB costs about $1 / year. (cons: tricky to configure your own cloud account and bucket permissions appropriately. If not ‘requester pays’ you pay indeterminate egress and API request fees).
A great comparison of current options with their caveats, pros and cons (Zenodo, Figshare, Dryad, OSF) The best free Research Data Repository - Dmytro Kryvokhyzha
I’m hoping that in this thread we can capture basic code examples of opening datasets stored in these various repositories.
For example, here is a quick example I put together for Zarr data served via a GitHub repo with a zenodo DOI (major caveats here is that dataset needs to be <1GB and chunks <100MB).
import xarray as xr import fsspec uri = 'https://scottyhq.github.io/zarrdata/air_temperature.zarr' ds = xr.open_dataset(uri, engine="zarr", consolidated=True) ds.air.isel(time=1).plot(x="lon")