Cleaning out the pangeo-data google cloud storage bucket

rabernat · November 12, 2019, 3:52am

When Pangeo got its first funding from the NSF Earthcube project (starting in October 2017), we received $100K of credits on Google Cloud Platform via the NSF BIGDATA program. We knew almost nothing about cloud, and needless to say, we have learned a lot.

When we were first starting out, we took a liberal approach to costs. Our goal was to experiment and learn. We gave dozens of people write access to our cloud storage, and we accumulated hundreds of different datasets. Almost all of these went into a single Google Cloud Storage bucket: pangeo-data.

That NSF project, and the cloud credits that came with it, are coming to an end. We need to take stock of the cloud data we have accumulated and make a plan for transitioning to a more sustainable future. This will involve moving and / or deleting much of the data we currently own. In particularly, we plan to delete the pangeo-data bucket and transition to a policy of “one bucket per data provider.”

An important step is to catalog the data in our intake data catalog. We would like all our data to eventually end up in catalog, rather than just living in a random, un-discoverable location in cloud storage. Our current catalog is here:
https://pangeo-data.github.io/pangeo-datastore/

@jhamman has compiled a list of datasets that are in our bucket but not in our catalog. We have put this information into a google spreadsheet:

Our goal over the next week is to categorize all these datasets into one of three categories:

Remove
Move to new bucket (not pangeo-data) and add to catalog
Keep in existing bucket and add to catalog

Below is a summary of the un-accounted-for datasets. If you recognize any of these, please open up the google sheet and claim it / help decide its fate.

All unclaimed datasets will be deleted on Friday, Nov. 22.

When deciding whether a dataset should be kept or deleted, we should consider several factors:

How big / expensive is it?
Is it being actively used?
Can the data easily be regenerated from another public archive?

Summary of Unclaimed Datasets

pangeo-data	CMIP5-ts	# datasets
pangeo-data	CMIP6	1
pangeo-data	CMIP6-test	3
pangeo-data	ECCO.zarr	1
pangeo-data	ECCO_chank.zarr	1
pangeo-data	ECCO_layers.zarr	1
pangeo-data	GEOS_V2p1	18
pangeo-data	GEOS_V2p1_L	18
pangeo-data	NATL60-CJM165-SSH-1h-1m2deg2deg	1
pangeo-data	NATL60-CJM165-SSH-1h-2D	1
pangeo-data	NATL60-CJM165-SSU-1h-1m2deg2deg	1
pangeo-data	NATL60-CJM165-SSV-1h-1m2deg2deg	1
pangeo-data	SLaPS	216
pangeo-data	amazonas	1
pangeo-data	avhrr-patmos-x-cloudprops-noaa-asc-fc_TESTING4	1
pangeo-data	balwada	4
pangeo-data	cesm	4
pangeo-data	channel	1
pangeo-data	cm2.6	3
pangeo-data	dataset-duacs-rep-global-merged-allsat-phy-l4-v3	1
pangeo-data	eNATL60-BLB002-SSU-1h	1
pangeo-data	eNATL60-BLB002-SSV-1h	1
pangeo-data	eNATL60-BLBT02-SSU-1h	1
pangeo-data	eNATL60-BLBT02-SSV-1h	1
pangeo-data	eNATL60-BLBT02X-ssh	2
pangeo-data	eNATL60-I	3
pangeo-data	eORCA025-I	1
pangeo-data	eORCA1-I	1
pangeo-data	ecco	11
pangeo-data	esgf_test	1
pangeo-data	gpm_imerg	4
pangeo-data	gross	1372
pangeo-data	kai-llc4320-vertical-fluxes	7
pangeo-data	llc4320	1
pangeo-data	model_vars_5day_av_zarr	1
pangeo-data	netcdf_test_data	1
pangeo-data	polar	4
pangeo-data	pyqg	10
pangeo-data	rsignell	17
pangeo-data	storage-benchmarks	3
pangeo-data	test	1
pangeo-data	tracer_vars_5day_av_zarr	1
pangeo-data	tracmip	7067
pangeo-data	xESMF_test	3
pangeo-data	zarr-eNATL60	6
pangeo-data	zarr_NATL60-CJM165_SSU_1h_y2013m07-09	1
pangeo-ecco	llc	3
pangeo-ocean-ml	LLC4320	124
pangeo-parcels	med_sea_connectivity_v2019.09.11.2	2

jhamman · November 12, 2019, 3:26pm

Pinging a bunch of people I want to make sure see this: @scottyhq, @rsignell, @naomi-henderson, @jbusecke, @jlesommer, @dhruvbalwada, @davidbrochart.

rabernat · November 12, 2019, 3:39pm

Perhaps we could also generate an email list of people with storage-admin rights?

jbusecke · November 12, 2019, 5:23pm

The ‚CMIP6‘ data is not the one we worked from at the hackathon I assume? Unless that is the case I have nothing to claim at this point. Thanks for the ping though.

willirath · November 20, 2019, 7:42am

If possible, I’d like to keep the approx. 300 GB (?) pangeo-parcels bucket around. The data are 40 days (~1000 hourly snapshots) of 15.000.000 Lagrangian particles in the Med Sea and I’m working on ways to generate fast visualisations from these trajectories.

jhamman · November 20, 2019, 8:19am

Sounds good @willirath. Can you add that data to the pangeo-datastore catalog?

Aurelie_ALBERT · November 28, 2019, 4:18pm

Hi, I am currently documenting and reorganizing the data I (on behalf of MEOM team in Grenoble) put on the cloud in order to fill the pangeo-datastore catalog (sorry it took so long to do it) and I was wondering if a dedicated bucket pangeo-nemo should be created ? Thanks

rabernat · November 29, 2019, 2:10pm

Yes, it would be very convenient to have a dedicated bucket for each group / project. I will create one for you.

rabernat · December 2, 2019, 6:07pm

I have created the pangeo-meom bucket for you @Aurelie_ALBERT. Could you please move all your data there? Thanks.

Aurelie_ALBERT · December 3, 2019, 8:29am

Thanks @rabernat I’m moving the data today !

Aurelie_ALBERT · December 3, 2019, 9:00am

Sorry @rabernat, there seems to be an issue with the pangeo-meom bucket, I get the error “BadRequestException: 400 Bucket is requester pays bucket but no user project provided.” each time I want to upload something or just “gsutil ls -a gs://pangeo-meom”. It must be linked with the parameter requester pays of the bucket but I cannot see it.

rabernat · December 3, 2019, 2:59pm

I think you need to specify the user project now: gsutil -u pangeo-181919 ...

https://cloud.google.com/storage/docs/using-requester-pays#using

Aurelie_ALBERT · December 4, 2019, 8:32am

Ok, now I get “AccessDeniedException: 403 auraoupa@gmail.com does not have serviceusage.services.use access to project 464800473488.”

davidbrochart · December 11, 2019, 4:32pm

I’m going to work on uploading the GPM data again. So the gpm_imerg data on pangeo-data can be deleted. Where should I upload now?

rabernat · December 12, 2019, 2:35am

@charlesbluca - could you assign a bucket to @davidbrochart?

kthyng · December 17, 2019, 2:43pm

Hi @rabernat and @jhamman! I am interested in getting a working example case available online for common tasks my group and interested parties want to do with our ROMS output. I checked out your data catalog and I don’t see model output in the ocean subsection. My hope is that we could store some output with Pangeo so that we could then use it in a Pangeo binder notebook for demonstration purposes. Somewhere between 10 and 100GB would be nice, and we could provide the output in zarr files if that is preferred. Maybe this is bad timing since you are ratcheting the data storage down but thought I’d see.

Alternatively, is there another good approach to have a working example case available online? I tried to use an hvplot with model output over the internet and it didn’t work, so it seems like even for demo purposes, the data and the analysis need to be co-located.

rabernat · December 17, 2019, 9:43pm

Thanks for your post @kthyng! And welcome to the forum!

As you can probably tell, the cloud data catalog is currently very much a work in progress. The organization of the catalog has evolved organically, and we’re happy to see it continue to do so. There are many ocean model datasets, but not a specific section for models. If you have a suggestion for how to re-organize things, please feel free to propose it over in the github repo:

In the long term, we would like to see a federation of data providers, rather than having Pangeo own all the data. (But still have this data be accessible through a single catalog.) Part of that process means getting people used to managing their own cloud storage.

In the case of your ROMS data, it costs very little. $100 GB of storage costs roughly $25 per year.

We could easily absorb this cost and create a new bucket for your data. But more than a financial cost, this creates a maintenance cost. I’d like to propose an alternative: could you host the data under your own google cloud account? When you sign up for Google cloud, you’ll get $300 worth of free credits, enough to maintain your dataset for 12 years. (To avoid egress costs, put the bucket in “requester pays” mode.) This would give you total ownership over your own data and also teach you something about cloud storage.

Once you have the data uploaded, you could then make a PR to add it to the Pangeo catalog.

How does this sound?

kthyng · December 17, 2019, 10:07pm

Yes, that sounds fair enough @rabernat. Just signed up for a bucket the other day in fact.

How do you see the other side of things working — the Pangeo binder side? Is it still possible to contribute a notebook to be able to run through Pangeo binder?

rabernat · December 18, 2019, 2:17pm

Absolutely! Anyone can do this now already, with no special permissions from us. Just go to https://binder.pangeo.io/.

Let us know if you get stuck on anything.

kthyng · December 19, 2019, 7:55pm

Awesome, thanks! I’ll be working on it.

Topic		Replies	Views
Us-central1 pangeo hub down?	56	689	December 11, 2024
Access to some Pangeo GCS Bucket to push data from CNES Cloud	4	701	September 29, 2019
Google storage gs:// URLs for Pangeo datasets on GCS Cloud	1	810	October 26, 2020
Migration of ocean.pangeo.io User Accounts Cloud	25	2254	September 27, 2020
Pangeo Forge bakeries Cloud	21	1253	October 19, 2023

Cleaning out the pangeo-data google cloud storage bucket

Summary of Unclaimed Datasets

Related topics