Delete access to Google Cloud Storage object

jdldeauna · February 16, 2022, 8:08pm

Hi! I’m trying to run the rechunker Cloud example:

url = 'gs://pangeo-cmems-duacs'
gcs = gcsfs.GCSFileSystem(requester_pays=True)
source_store = gcs.get_mapper(url)

group = zarr.open_consolidated(source_store, mode='r')
source_array = group['sla']

max_mem = '1GB'
target_chunks = (8901, 72, 72)

scratch_path = os.environ['PANGEO_SCRATCH']

store_tmp = gcs.get_mapper(f'{scratch_path}/jdldeauna/rechunker_demo/temp_data_8.zarr' )
store_target = gcs.get_mapper(f'{scratch_path}/jdldeauna/rechunker_demo/target_data_8.zarr')

r = rechunk(source_array, target_chunks, max_mem,
                      store_target, temp_store=store_tmp)

However, executing r produces the following error:

result = r.execute()

OSError: Forbidden: https://storage.googleapis.com/upload/storage/v1/b/pangeo-integration-te-3eea-prod-scratch-bucket/o
prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.

Is it recommended to have a Google Service Account to be able to work with rechunker? I would appreciate any suggestions, thank you so much!

rabernat · February 17, 2022, 2:07pm

I just confirmed this. Here is a minimal reproducer

import os
import fsspec
import gcsfs

with fsspec.open(os.environ['PANGEO_SCRATCH'] + '/test', mode='w') as fp:
    fp.write('foobar')

fs = gcsfs.GCSFileSystem()
fs.ls(os.environ['PANGEO_SCRATCH'])

fs.rm(os.environ['PANGEO_SCRATCH'] + '/test')

Gives

prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have 
storage.objects.delete access to the Google Cloud Storage object.

I will have someone from 2i2c look into changing the permissions.

TomAugspurger · February 17, 2022, 3:11pm

Quick note: this might have been intentional. Recall that we don’t have user “namespaces” in the scratch bucket, and so granting delete access will let everyone delete everyone else’s files in the scratch bucket. That might be an acceptable tradeoff for a group of trusted users (you can already read everyone else’s scratch files)

jdldeauna · February 17, 2022, 5:34pm

Thanks @rabernat for looking into this! @TomAugspurger would it be possible to just have write access to the scratch bucket? I understand that files are already deleted every 7 days, so I don’t really need delete access. I’m not sure though why executing a rechunk requires delete access according to the error.

TomAugspurger · February 17, 2022, 6:18pm

I think that specific error is from rechunker trying to write to a clean directory. In the meantime, you might try setting store_target to a non-existent directory, like store_target = gcs.get_mapper(f'{scratch_path}/jdldeauna/rechunker_demo/target_data_9.zarr')

jdldeauna · February 18, 2022, 7:34pm

Unfortunately even with changing the directory name for temp / target, the same error appears once the recheck is executed

result = r.execute()

OSError: Forbidden: https://storage.googleapis.com/upload/storage/v1/b/pangeo-integration-te-3eea-prod-scratch-bucket/o
prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.

choldgraf · February 21, 2022, 9:38pm

Hey all - it sounds like there was a bit of uncertainty on whether this was “intended behavior” or not. As others mentioned, right now people can read/write to scratch, but they can’t delete. Can we get a confirmation that Pangeo community wants to give everybody the ability to delete as well?

jdldeauna · February 21, 2022, 10:56pm

Hi! I’m really sorry for the confusion, the error appears to be related to delete access, but it appears even when I’m only trying to write to my scratch bucket. When I run part of Ryan’s reproducer for example:

import os
import fsspec
import gcsfs

with fsspec.open(os.environ['PANGEO_SCRATCH'] + '/test', mode='w') as fp:
    fp.write('foobar')

A similar error appears:

OSError: Forbidden: https://storage.googleapis.com/upload/storage/v1/b/pangeo-integration-te-3eea-prod-scratch-bucket/o?uploadType=resumable&upload_id=ADPycdt_n7O4P4l-dNXBz4srYjyQv-LDHsLvnGnAV1LMvFNMzzo7wtkhJI7JOGRbHnvCPVsY3dMKPWSAdF0yAuPCzX8
prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.

jbusecke · February 22, 2022, 12:51am

I personally would love to get delete access to the pangeo scratch bucket, for precisely that kind of processing @jdldeauna is doing here.
I think it is a very common pattern to rechunk a large dataset, and then derive/save some much smaller output, which does not have to live on the scratch bucket.

@jdldeauna I think what you are trying to do is overwrite something, which AFAIK actually deletes first and then writes again, and thus needs delete rights. I have encountered that in the past.

I assume there is no way to allow some sort of permission, so that a user can only delete their own data?

jdldeauna · February 22, 2022, 3:02am

Oh I see, if I change with fsspec.open(os.environ['PANGEO_SCRATCH'] + '/test', mode='w') as fp: to '/test2' I’m able to write the file, sorry I misunderstood that. Similarly, going back to the rechunker example, I was just changing the filename (e.g., temp_data_8.zarr to temp_data_9.zarr) when I should have also changed the folder name (e.g., rechunker_demo) to avoid overwriting previous data. As for the delete permissions I’m fine with however the community decides @choldgraf, but it might be nice to have personal access, as suggested by @jbusecke. Thank you!

sgibson91 · February 22, 2022, 1:59pm

Hey all,

I believe object storage works a little bit like git commit, so read/write permissions allow you overwrite essentially by creating a new file with the changes. This is why delete is a separate permission because it can also mean delete all versions of a file.

Unfortunately, this is true. However, I will enable the delete permission now as this is an experienced community.

rabernat · February 22, 2022, 3:00pm

Thanks so much Sarah for your help!

Just noting that finding a good general solution to this problem–providing private “scratch” storage to cloud Jupyter users–is an important and challenging DevOps problem that we have been discussing in Pangeo for many years. The central challenge is that, within the Kubernetes cluster where the hub runs, there is no mapping between hub identity (e.g. the username you log into the hub with, usually from GitHub), and a unique cloud-provider identity. If there were such a mapping, we could just create a bucket for each hub users. But as is, all hub users look identical to the cloud-provider. So we have no choice but to provide uniform global access to the scratch bucket for all users.

If any DevOps engineers are reading this and would like to work towards a better solution, please jump right in!

sgibson91 · February 22, 2022, 3:13pm

I believe this is now working (screenshot is a test on staging, but I have propagated that change to prod as well)

TomAugspurger · February 22, 2022, 4:03pm

We’re in the design stages for something similar on Azure (call it a “user-data” service). It’s likely that the concepts will generalize to other clouds. Our requirements are

Users can only view / modify only their own data.
The system is able to enforce some kind of quota on bytes stored per user.
When bytes are actually being written to / read from blob storage, we don’t want anything in between the user and the Blob Storage service.

First, you’ll need some sort of identity system. pangeo-cloud could piggyback on JupyterHub or use auth0. All requests to the user-data service must be authenticated.

For uploading data, users will make requests to the user data service, requesting permission to write a specific number of bytes to a specific key. The service will verify that this is OK (the user hasn’t exceeded their quota, for example), and will issue a SAS token that can only write to that specific key. The user can upload the data using their normal means (fsspec/adlfs, or azure.storage.blob, etc.)

After it’s written, the user-data service will verify that the write is OK (e.g. isn’t larger than was requested) using Azure’s event grid. If it’s too large, we’ll delete it and somehow notify the user.

Reading specific keys is pretty similar. Users request permission to read a key and get a SAS token. Listing “directories” is more challenging, because Azure Blob Storage doesn’t have a built-in concept of SAS-tokens that are limited to prefixes. We’re still figuring that out.

So it’s yet another service to run, and pretty complicated to what pangeo-cloud has today (and it’s purely theoretical right now ) but we’ll post details if and when it becomes a reality.

rabernat · February 22, 2022, 4:10pm

Wait Tom you’re a DevOps engineer? I thought you were an oceanographer now!

But seriously, this sounds very cool.

Topic		Replies	Views
Scratch Bucket is not working on new Pangeo Cloud cluster Pangeo Cloud Support	5	665	January 4, 2022
Cleaning out the pangeo-data google cloud storage bucket Cloud	27	2648	February 5, 2020
Serviceusage error Pangeo Cloud Support	2	1046	July 4, 2020
Access to some Pangeo GCS Bucket to push data from CNES Cloud	4	709	September 29, 2019
Write access to `s3://pangeo-data-upload-oregon` Pangeo Cloud Support	9	842	June 16, 2021

Delete access to Google Cloud Storage object

Related topics