Delete access to Google Cloud Storage object

Hi! I’m trying to run the rechunker Cloud example:

url = 'gs://pangeo-cmems-duacs'
gcs = gcsfs.GCSFileSystem(requester_pays=True)
source_store = gcs.get_mapper(url)

group = zarr.open_consolidated(source_store, mode='r')
source_array = group['sla']

max_mem = '1GB'
target_chunks = (8901, 72, 72)

scratch_path = os.environ['PANGEO_SCRATCH']

store_tmp = gcs.get_mapper(f'{scratch_path}/jdldeauna/rechunker_demo/temp_data_8.zarr' )
store_target = gcs.get_mapper(f'{scratch_path}/jdldeauna/rechunker_demo/target_data_8.zarr')

r = rechunk(source_array, target_chunks, max_mem,
                      store_target, temp_store=store_tmp)

However, executing r produces the following error:

result = r.execute()

OSError: Forbidden: https://storage.googleapis.com/upload/storage/v1/b/pangeo-integration-te-3eea-prod-scratch-bucket/o
prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.

Is it recommended to have a Google Service Account to be able to work with rechunker? I would appreciate any suggestions, thank you so much!

1 Like

I just confirmed this. Here is a minimal reproducer

import os
import fsspec
import gcsfs

with fsspec.open(os.environ['PANGEO_SCRATCH'] + '/test', mode='w') as fp:
    fp.write('foobar')

fs = gcsfs.GCSFileSystem()
fs.ls(os.environ['PANGEO_SCRATCH'])

fs.rm(os.environ['PANGEO_SCRATCH'] + '/test')

Gives

prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have 
storage.objects.delete access to the Google Cloud Storage object.

I will have someone from 2i2c look into changing the permissions.

1 Like

Quick note: this might have been intentional. Recall that we don’t have user “namespaces” in the scratch bucket, and so granting delete access will let everyone delete everyone else’s files in the scratch bucket. That might be an acceptable tradeoff for a group of trusted users (you can already read everyone else’s scratch files)

2 Likes

Thanks @rabernat for looking into this! @TomAugspurger would it be possible to just have write access to the scratch bucket? I understand that files are already deleted every 7 days, so I don’t really need delete access. I’m not sure though why executing a rechunk requires delete access according to the error.

I think that specific error is from rechunker trying to write to a clean directory. In the meantime, you might try setting store_target to a non-existent directory, like store_target = gcs.get_mapper(f'{scratch_path}/jdldeauna/rechunker_demo/target_data_9.zarr')

Unfortunately even with changing the directory name for temp / target, the same error appears once the recheck is executed :frowning:

result = r.execute()

OSError: Forbidden: https://storage.googleapis.com/upload/storage/v1/b/pangeo-integration-te-3eea-prod-scratch-bucket/o
prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.

Hey all - it sounds like there was a bit of uncertainty on whether this was “intended behavior” or not. As others mentioned, right now people can read/write to scratch, but they can’t delete. Can we get a confirmation that Pangeo community wants to give everybody the ability to delete as well?

Hi! I’m really sorry for the confusion, the error appears to be related to delete access, but it appears even when I’m only trying to write to my scratch bucket. When I run part of Ryan’s reproducer for example:

import os
import fsspec
import gcsfs

with fsspec.open(os.environ['PANGEO_SCRATCH'] + '/test', mode='w') as fp:
    fp.write('foobar')

A similar error appears:

OSError: Forbidden: https://storage.googleapis.com/upload/storage/v1/b/pangeo-integration-te-3eea-prod-scratch-bucket/o?uploadType=resumable&upload_id=ADPycdt_n7O4P4l-dNXBz4srYjyQv-LDHsLvnGnAV1LMvFNMzzo7wtkhJI7JOGRbHnvCPVsY3dMKPWSAdF0yAuPCzX8
prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.

I personally would love to get delete access to the pangeo scratch bucket, for precisely that kind of processing @jdldeauna is doing here.
I think it is a very common pattern to rechunk a large dataset, and then derive/save some much smaller output, which does not have to live on the scratch bucket.

@jdldeauna I think what you are trying to do is overwrite something, which AFAIK actually deletes first and then writes again, and thus needs delete rights. I have encountered that in the past.

I assume there is no way to allow some sort of permission, so that a user can only delete their own data?

Oh I see, if I change with fsspec.open(os.environ['PANGEO_SCRATCH'] + '/test', mode='w') as fp: to '/test2' I’m able to write the file, sorry I misunderstood that. Similarly, going back to the rechunker example, I was just changing the filename (e.g., temp_data_8.zarr to temp_data_9.zarr) when I should have also changed the folder name (e.g., rechunker_demo) to avoid overwriting previous data. As for the delete permissions I’m fine with however the community decides @choldgraf, but it might be nice to have personal access, as suggested by @jbusecke. Thank you!

Hey all,

I believe object storage works a little bit like git commit, so read/write permissions allow you overwrite essentially by creating a new file with the changes. This is why delete is a separate permission because it can also mean delete all versions of a file.

Unfortunately, this is true. However, I will enable the delete permission now as this is an experienced community.

2 Likes

Thanks so much Sarah for your help! :pray:

Just noting that finding a good general solution to this problem–providing private “scratch” storage to cloud Jupyter users–is an important and challenging DevOps problem that we have been discussing in Pangeo for many years. The central challenge is that, within the Kubernetes cluster where the hub runs, there is no mapping between hub identity (e.g. the username you log into the hub with, usually from GitHub), and a unique cloud-provider identity. If there were such a mapping, we could just create a bucket for each hub users. But as is, all hub users look identical to the cloud-provider. So we have no choice but to provide uniform global access to the scratch bucket for all users.

If any DevOps engineers are reading this and would like to work towards a better solution, please jump right in!

2 Likes

I believe this is now working (screenshot is a test on staging, but I have propagated that change to prod as well)

2 Likes

We’re in the design stages for something similar on Azure (call it a “user-data” service). It’s likely that the concepts will generalize to other clouds. Our requirements are

  1. Users can only view / modify only their own data.
  2. The system is able to enforce some kind of quota on bytes stored per user.
  3. When bytes are actually being written to / read from blob storage, we don’t want anything in between the user and the Blob Storage service.

First, you’ll need some sort of identity system. pangeo-cloud could piggyback on JupyterHub or use auth0. All requests to the user-data service must be authenticated.

For uploading data, users will make requests to the user data service, requesting permission to write a specific number of bytes to a specific key. The service will verify that this is OK (the user hasn’t exceeded their quota, for example), and will issue a SAS token that can only write to that specific key. The user can upload the data using their normal means (fsspec/adlfs, or azure.storage.blob, etc.)

After it’s written, the user-data service will verify that the write is OK (e.g. isn’t larger than was requested) using Azure’s event grid. If it’s too large, we’ll delete it and somehow notify the user.

Reading specific keys is pretty similar. Users request permission to read a key and get a SAS token. Listing “directories” is more challenging, because Azure Blob Storage doesn’t have a built-in concept of SAS-tokens that are limited to prefixes. We’re still figuring that out.

So it’s yet another service to run, and pretty complicated to what pangeo-cloud has today (and it’s purely theoretical right now :smile:) but we’ll post details if and when it becomes a reality.

3 Likes

Wait Tom you’re a DevOps engineer? I thought you were an oceanographer now! :laughing:

But seriously, this sounds very cool.

1 Like