How to allow dask workers to read from a requester pays s3 bucket using rasterio?

rsignell · May 9, 2020, 8:26pm

I can read a COG from a requester pays s3 bucket with xarray/rasterio by doing:

os.environ["AWS_REQUEST_PAYER"] = "requester" 
cog = 's3://dev-et-data/compressed/NDVI_filled/2001/2001001.250_m_NDVI.tif'
da = xr.open_rasterio(cog, chunks={'band':1, 'x':4096, 'y':4096})

but if I create a dask client to speed things up, the workers get access denied errors. Has anyone used dask to read from a requester pays bucket with rasterio?

TomAugspurger · May 11, 2020, 11:39am

Looking at http://xarray.pydata.org/en/stable/generated/xarray.open_rasterio.html, I don’t see a way to pass arguments to open_rasterio that eventually go through to whatever is interacting with AWS (s3fs?). So I think you’re stuck trying to set environment variables on the workers.

How’s the cluster being created? Setting it on the workers as they’re created is the most robust way to do this.

Short of that, you can run a function on the workers that sets them, after you have a cluster scaled up.

client.run(lambda: os.environ["AWS_REQUEST_PAYER"] = "requester" )

But if a worker crashes and is replaced, the new worker won’t have that variable set, and so will fail to read data.

rsignell · May 11, 2020, 12:15pm

I’m running on dask-kubernetes (and hopefully soon on dask-gateway). I tried to figure out how to set environment variables via the env: {} in ~/.config/dask/kubernetes.yaml:

# kubernetes:
#   name: "dask-{user}-{uuid}"
#   namespace: null
#   count:
#     start: 0
#     max: null
#   host: "0.0.0.0"
#   port: 0
#   env: {}

but could not figure it out.

rsignell · May 11, 2020, 3:30pm

It turns out that I was setting the worker environment correctly in my ~/.config/dask/kubernetes.yaml file:

kubernetes:
  env: 
    AWS_REQUEST_PAYER: requester

but my KubeCluster was not picking that up because I was creating the cluster in the Dask JupyterLab extension. If I create my cluster inline in the Notebook using cluster = KubeCluster() it works fine.

I spent some time trying to set environment variables when KubeCluster is created in the extension, but failed.

So for now, my workaround will be to just create my KubeCluster from the notebook and pass in the env as a parameter (instead of modifying the kubernetes.yaml file):

cluster = KubeCluster(n_workers=2, env={'AWS_REQUEST_PAYER': 'requester'})

So here’s complete notebook example to read COGs from a requester pays bucket in parallel using xarray and dask:

import os
import xarray as xr
from dask.distributed import Client
from dask_kubernetes import KubeCluster

os.environ["AWS_REQUEST_PAYER"] = "requester" 
cluster = KubeCluster(n_workers=2, env={'AWS_REQUEST_PAYER': 'requester'})
client = Client(cluster)
cog = 's3://dev-et-data/compressed/NDVI_filled/2001/2001001.250_m_NDVI.tif'
da = xr.open_rasterio(cog, chunks={'band':1, 'x':4096, 'y':4096})
da.load()

Thanks @TomAugspurger and @jsignell for the help!

martindurant · May 11, 2020, 4:36pm

A PR to intake-xarray to pass arguments through would be welcomed.

Topic		Replies	Views
Xarray loading data locally when Dask is distributed Data	3	511	February 24, 2022
No longer able to set `dask_gateway.Gateway.cluster_options()` manually from user-end Pangeo Cloud Support	5	441	February 6, 2024
Go multi regional with Dask (AWS) Cloud	3	512	January 4, 2023
RioXarray & Dask in a cloud env Cloud	7	1659	December 5, 2021
Access GES DISC NASA dataset using xarray and dask on a cluster Data	1	731	February 21, 2023

How to allow dask workers to read from a requester pays s3 bucket using rasterio?

Related topics