When Pangeo got its first funding from the NSF Earthcube project (starting in October 2017), we received $100K of credits on Google Cloud Platform via the NSF BIGDATA program. We knew almost nothing about cloud, and needless to say, we have learned a lot.
When we were first starting out, we took a liberal approach to costs. Our goal was to experiment and learn. We gave dozens of people write access to our cloud storage, and we accumulated hundreds of different datasets. Almost all of these went into a single Google Cloud Storage bucket: pangeo-data
.
That NSF project, and the cloud credits that came with it, are coming to an end. We need to take stock of the cloud data we have accumulated and make a plan for transitioning to a more sustainable future. This will involve moving and / or deleting much of the data we currently own. In particularly, we plan to delete the pangeo-data
bucket and transition to a policy of âone bucket per data provider.â
An important step is to catalog the data in our intake data catalog. We would like all our data to eventually end up in catalog, rather than just living in a random, un-discoverable location in cloud storage. Our current catalog is here:
https://pangeo-data.github.io/pangeo-datastore/
@jhamman has compiled a list of datasets that are in our bucket but not in our catalog. We have put this information into a google spreadsheet:
Our goal over the next week is to categorize all these datasets into one of three categories:
- Remove
- Move to new bucket (not
pangeo-data
) and add to catalog - Keep in existing bucket and add to catalog
Below is a summary of the un-accounted-for datasets. If you recognize any of these, please open up the google sheet and claim it / help decide its fate.
All unclaimed datasets will be deleted on Friday, Nov. 22.
When deciding whether a dataset should be kept or deleted, we should consider several factors:
- How big / expensive is it?
- Is it being actively used?
- Can the data easily be regenerated from another public archive?
Summary of Unclaimed Datasets
pangeo-data | CMIP5-ts | # datasets |
---|---|---|
pangeo-data | CMIP6 | 1 |
pangeo-data | CMIP6-test | 3 |
pangeo-data | ECCO.zarr | 1 |
pangeo-data | ECCO_chank.zarr | 1 |
pangeo-data | ECCO_layers.zarr | 1 |
pangeo-data | GEOS_V2p1 | 18 |
pangeo-data | GEOS_V2p1_L | 18 |
pangeo-data | NATL60-CJM165-SSH-1h-1m2deg2deg | 1 |
pangeo-data | NATL60-CJM165-SSH-1h-2D | 1 |
pangeo-data | NATL60-CJM165-SSU-1h-1m2deg2deg | 1 |
pangeo-data | NATL60-CJM165-SSV-1h-1m2deg2deg | 1 |
pangeo-data | SLaPS | 216 |
pangeo-data | amazonas | 1 |
pangeo-data | avhrr-patmos-x-cloudprops-noaa-asc-fc_TESTING4 | 1 |
pangeo-data | balwada | 4 |
pangeo-data | cesm | 4 |
pangeo-data | channel | 1 |
pangeo-data | cm2.6 | 3 |
pangeo-data | dataset-duacs-rep-global-merged-allsat-phy-l4-v3 | 1 |
pangeo-data | eNATL60-BLB002-SSU-1h | 1 |
pangeo-data | eNATL60-BLB002-SSV-1h | 1 |
pangeo-data | eNATL60-BLBT02-SSU-1h | 1 |
pangeo-data | eNATL60-BLBT02-SSV-1h | 1 |
pangeo-data | eNATL60-BLBT02X-ssh | 2 |
pangeo-data | eNATL60-I | 3 |
pangeo-data | eORCA025-I | 1 |
pangeo-data | eORCA1-I | 1 |
pangeo-data | ecco | 11 |
pangeo-data | esgf_test | 1 |
pangeo-data | gpm_imerg | 4 |
pangeo-data | gross | 1372 |
pangeo-data | kai-llc4320-vertical-fluxes | 7 |
pangeo-data | llc4320 | 1 |
pangeo-data | model_vars_5day_av_zarr | 1 |
pangeo-data | netcdf_test_data | 1 |
pangeo-data | polar | 4 |
pangeo-data | pyqg | 10 |
pangeo-data | rsignell | 17 |
pangeo-data | storage-benchmarks | 3 |
pangeo-data | test | 1 |
pangeo-data | tracer_vars_5day_av_zarr | 1 |
pangeo-data | tracmip | 7067 |
pangeo-data | xESMF_test | 3 |
pangeo-data | zarr-eNATL60 | 6 |
pangeo-data | zarr_NATL60-CJM165_SSU_1h_y2013m07-09 | 1 |
pangeo-ecco | llc | 3 |
pangeo-ocean-ml | LLC4320 | 124 |
pangeo-parcels | med_sea_connectivity_v2019.09.11.2 | 2 |