Cleaning out the pangeo-data google cloud storage bucket

When Pangeo got its first funding from the NSF Earthcube project (starting in October 2017), we received $100K of credits on Google Cloud Platform via the NSF BIGDATA program. We knew almost nothing about cloud, and needless to say, we have learned a lot.

When we were first starting out, we took a liberal approach to costs. Our goal was to experiment and learn. We gave dozens of people write access to our cloud storage, and we accumulated hundreds of different datasets. Almost all of these went into a single Google Cloud Storage bucket: pangeo-data.

That NSF project, and the cloud credits that came with it, are coming to an end. We need to take stock of the cloud data we have accumulated and make a plan for transitioning to a more sustainable future. This will involve moving and / or deleting much of the data we currently own. In particularly, we plan to delete the pangeo-data bucket and transition to a policy of “one bucket per data provider.”

An important step is to catalog the data in our intake data catalog. We would like all our data to eventually end up in catalog, rather than just living in a random, un-discoverable location in cloud storage. Our current catalog is here:
https://pangeo-data.github.io/pangeo-datastore/

@jhamman has compiled a list of datasets that are in our bucket but not in our catalog. We have put this information into a google spreadsheet:

Our goal over the next week is to categorize all these datasets into one of three categories:

  • Remove
  • Move to new bucket (not pangeo-data) and add to catalog
  • Keep in existing bucket and add to catalog

Below is a summary of the un-accounted-for datasets. If you recognize any of these, please open up the google sheet and claim it / help decide its fate.

All unclaimed datasets will be deleted on Friday, Nov. 22.

When deciding whether a dataset should be kept or deleted, we should consider several factors:

  • How big / expensive is it?
  • Is it being actively used?
  • Can the data easily be regenerated from another public archive?

Summary of Unclaimed Datasets

pangeo-data CMIP5-ts # datasets
pangeo-data CMIP6 1
pangeo-data CMIP6-test 3
pangeo-data ECCO.zarr 1
pangeo-data ECCO_chank.zarr 1
pangeo-data ECCO_layers.zarr 1
pangeo-data GEOS_V2p1 18
pangeo-data GEOS_V2p1_L 18
pangeo-data NATL60-CJM165-SSH-1h-1m2deg2deg 1
pangeo-data NATL60-CJM165-SSH-1h-2D 1
pangeo-data NATL60-CJM165-SSU-1h-1m2deg2deg 1
pangeo-data NATL60-CJM165-SSV-1h-1m2deg2deg 1
pangeo-data SLaPS 216
pangeo-data amazonas 1
pangeo-data avhrr-patmos-x-cloudprops-noaa-asc-fc_TESTING4 1
pangeo-data balwada 4
pangeo-data cesm 4
pangeo-data channel 1
pangeo-data cm2.6 3
pangeo-data dataset-duacs-rep-global-merged-allsat-phy-l4-v3 1
pangeo-data eNATL60-BLB002-SSU-1h 1
pangeo-data eNATL60-BLB002-SSV-1h 1
pangeo-data eNATL60-BLBT02-SSU-1h 1
pangeo-data eNATL60-BLBT02-SSV-1h 1
pangeo-data eNATL60-BLBT02X-ssh 2
pangeo-data eNATL60-I 3
pangeo-data eORCA025-I 1
pangeo-data eORCA1-I 1
pangeo-data ecco 11
pangeo-data esgf_test 1
pangeo-data gpm_imerg 4
pangeo-data gross 1372
pangeo-data kai-llc4320-vertical-fluxes 7
pangeo-data llc4320 1
pangeo-data model_vars_5day_av_zarr 1
pangeo-data netcdf_test_data 1
pangeo-data polar 4
pangeo-data pyqg 10
pangeo-data rsignell 17
pangeo-data storage-benchmarks 3
pangeo-data test 1
pangeo-data tracer_vars_5day_av_zarr 1
pangeo-data tracmip 7067
pangeo-data xESMF_test 3
pangeo-data zarr-eNATL60 6
pangeo-data zarr_NATL60-CJM165_SSU_1h_y2013m07-09 1
pangeo-ecco llc 3
pangeo-ocean-ml LLC4320 124
pangeo-parcels med_sea_connectivity_v2019.09.11.2 2
1 Like

Pinging a bunch of people I want to make sure see this: @scottyhq, @rsignell, @naomi-henderson, @jbusecke, @jlesommer, @dhruvbalwada, @davidbrochart.

1 Like

Perhaps we could also generate an email list of people with storage-admin rights?

1 Like

The ‚CMIP6‘ data is not the one we worked from at the hackathon I assume? Unless that is the case I have nothing to claim at this point. Thanks for the ping though.

1 Like

If possible, I’d like to keep the approx. 300 GB (?) pangeo-parcels bucket around. The data are 40 days (~1000 hourly snapshots) of 15.000.000 Lagrangian particles in the Med Sea and I’m working on ways to generate fast visualisations from these trajectories.

1 Like

Sounds good @willirath. Can you add that data to the pangeo-datastore catalog?

Hi, I am currently documenting and reorganizing the data I (on behalf of MEOM team in Grenoble) put on the cloud in order to fill the pangeo-datastore catalog (sorry it took so long to do it) and I was wondering if a dedicated bucket pangeo-nemo should be created ? Thanks

Yes, it would be very convenient to have a dedicated bucket for each group / project. I will create one for you.

I have created the pangeo-meom bucket for you @Aurelie_ALBERT. Could you please move all your data there? Thanks.

Thanks @rabernat I’m moving the data today !

Sorry @rabernat, there seems to be an issue with the pangeo-meom bucket, I get the error “BadRequestException: 400 Bucket is requester pays bucket but no user project provided.” each time I want to upload something or just “gsutil ls -a gs://pangeo-meom”. It must be linked with the parameter requester pays of the bucket but I cannot see it.

I think you need to specify the user project now: gsutil -u pangeo-181919 ...

https://cloud.google.com/storage/docs/using-requester-pays#using

Ok, now I get “AccessDeniedException: 403 auraoupa@gmail.com does not have serviceusage.services.use access to project 464800473488.”

I’m going to work on uploading the GPM data again. So the gpm_imerg data on pangeo-data can be deleted. Where should I upload now?

@charlesbluca - could you assign a bucket to @davidbrochart?

Hi @rabernat and @jhamman! I am interested in getting a working example case available online for common tasks my group and interested parties want to do with our ROMS output. I checked out your data catalog and I don’t see model output in the ocean subsection. My hope is that we could store some output with Pangeo so that we could then use it in a Pangeo binder notebook for demonstration purposes. Somewhere between 10 and 100GB would be nice, and we could provide the output in zarr files if that is preferred. Maybe this is bad timing since you are ratcheting the data storage down but thought I’d see.

Alternatively, is there another good approach to have a working example case available online? I tried to use an hvplot with model output over the internet and it didn’t work, so it seems like even for demo purposes, the data and the analysis need to be co-located.

Thanks for your post @kthyng! And welcome to the forum!

As you can probably tell, the cloud data catalog is currently very much a work in progress. The organization of the catalog has evolved organically, and we’re happy to see it continue to do so. There are many ocean model datasets, but not a specific section for models. If you have a suggestion for how to re-organize things, please feel free to propose it over in the github repo:

In the long term, we would like to see a federation of data providers, rather than having Pangeo own all the data. (But still have this data be accessible through a single catalog.) Part of that process means getting people used to managing their own cloud storage.

In the case of your ROMS data, it costs very little. $100 GB of storage costs roughly $25 per year.

We could easily absorb this cost and create a new bucket for your data. But more than a financial cost, this creates a maintenance cost. I’d like to propose an alternative: could you host the data under your own google cloud account? When you sign up for Google cloud, you’ll get $300 worth of free credits, enough to maintain your dataset for 12 years. (To avoid egress costs, put the bucket in “requester pays” mode.) This would give you total ownership over your own data and also teach you something about cloud storage.

Once you have the data uploaded, you could then make a PR to add it to the Pangeo catalog.

How does this sound? :smile:

Yes, that sounds fair enough @rabernat. Just signed up for a bucket the other day in fact.

How do you see the other side of things working — the Pangeo binder side? Is it still possible to contribute a notebook to be able to run through Pangeo binder?

Absolutely! Anyone can do this now already, with no special permissions from us. Just go to https://binder.pangeo.io/.

Let us know if you get stuck on anything.

Awesome, thanks! I’ll be working on it.