RioXarray & Dask in a cloud env

pl.marasco · December 1, 2021, 4:38pm

I’m working with some Modis data, the scenes have been downloaded on a local folder so that I can read through rioxarray. Once read the second step is to scale, thanks to rioxarray, the 500 m product to 250 m. Over our HPC I have no issue achieving the reprojection but the folder where files are is visible from all the nodes. If I try to achieve the same over the planetary computer I’m getting an error of readability of files. (I’ve tested the same on the Pangeo AWS Deployment but the situation is a little bit more complicated as seems to me that Rioxarray isn’t updated and provoke other errors)

The code looks like

ds_250 = rxr.open_rasterio(filename_250, chunks= ‘auto’)
ds_500 = rxr.open_rasterio(filename_500, chunks= ‘auto’)

ds_500_match = ds_500.rio.reproject_match(ds_250)

and the error looks like :

/tmp/modis/MOD09A1_22_07_2021321_MOD09A1.A2021321.h22v07.006.2021331124727.hdf:MOD_Grid_500m_Surface_Reflectance:sur_refl_b01: No such file or directory

If I don’t specify any Dask cluster, the code runs smoothly but on the contrary when workers are involved isn’t possible to get the reprojection done.
I’m prone to think that problem is related to the folder visibility from workers. (I would assume that workers are on different pods and can’t see the original location)
Am I far from the truth? How I can solve this?

TomAugspurger · December 1, 2021, 6:00pm

Can you share how you’re making the cluster? Is it a GateayCluster() or a LocalCluster()?

If you’re using a GateayCluster, then yes, your guess is correct. The workers won’t have access to notebook server’s filesystem, so they can’t read those files. This is touched on a bit at Pangeo Cloud — Pangeo documentation.

So right now your options are to do things as you are with a LocalCluster (the Python environment on the Planetary Computer has 4 cores, but you might get I/O bound quickly), or shift to using Azure Blob Storage, and pass around a credential so that all your workers can read / write to that blob filesystem.

Finally, we (Microsoft / AI For Earth) do have Modis available through Blob Storage, but it’s under some flux right now. AIforEarthDataSets/modis.md at main · microsoft/AIforEarthDataSets · GitHub has the details, but right now it’s accessible from East US. We should have it available through the STAC API some time in the next 3-6 months. You might be able to read from there if that’s the right products (might be a bit slow if you’re using the Planetary Computer Hub, which is in West Europe).

pl.marasco · December 1, 2021, 6:24pm

Is through a GatewayCluster and what I’m trying to do is to avoid the local cluster to scale up the process.

About the Azure Blob I would assume that to let rioxarray able to read from a blob I would need fsspec. Am I right?

Up to now I was just grabbing ideas from this example and definitely Microsoft Azure Blobs are the origin of my data.

TomAugspurger · December 1, 2021, 6:39pm

You would probably want to use adlfs or azure.storage.blob to list the files in directories. But once you have a URL you can just pass that to rioxarray. It uses GDAL’s HTTP handling to fetch the data.

So if you’re already getting the data from Blob Storage, I’d recommend streaming it straight into a DataArray rather than writing to a local filesystem (Skip the download! Stream NASA data directly into Python objects | by Scott Henderson | pangeo | Medium explains this nicely)

pl.marasco · December 3, 2021, 2:45pm

I’ve already the URL but seems that something goes wrong at GDAL level. Here below what I’ve been testing and the error:

rxr.open_rasterio('https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf', chunks= 'auto')

return an error like this:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    198             try:
--> 199                 file = self._cache[self._key]
    200             except KeyError:

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     52         with self._lock:
---> 53             value = self._cache[key]
     54             self._cache.move_to_end(key)

KeyError: [<function open at 0x7fe52d837700>, ('https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf',), 'r', (('sharing', False),)]

During handling of the above exception, another exception occurred:

CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_shim.pyx in rasterio._shim.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: '/vsicurl/https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
/tmp/ipykernel_539/3845122745.py in <module>
----> 1 rxr.open_rasterio('https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf' class="ansi-blue-fg">, chunks= 'auto')

/srv/conda/envs/notebook/lib/python3.8/site-packages/rioxarray/_io.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock, masked, mask_and_scale, variable, group, default_name, decode_times, decode_timedelta, **open_kwargs)
    831         else:
    832             manager = URIManager(rasterio.open, filename, mode="r", kwargs=open_kwargs)
--> 833         riods = manager.acquire()
    834         captured_warnings = rio_warnings.copy()
    835 

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/file_manager.py in acquire(self, needs_lock)
    179             An open file object, as returned by ``opener(*args, **kwargs)``.
    180         """
--> 181         file, _ = self._acquire_with_cache_info(needs_lock)
    182         return file
    183 

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    203                     kwargs = kwargs.copy()
    204                     kwargs["mode"] = self._mode
--> 205                 file = self._opener(*self._args, **kwargs)
    206                 if self._mode == "w":
    207                     # ensure file doesn't get overriden when opened again

/srv/conda/envs/notebook/lib/python3.8/site-packages/rasterio/env.py in wrapper(*args, **kwds)
    435 
    436         with env_ctor(session=session):
--> 437             return f(*args, **kwds)
    438 
    439     return wrapper

/srv/conda/envs/notebook/lib/python3.8/site-packages/rasterio/__init__.py in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
    218         # None.
    219         if mode == 'r':
--> 220             s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
    221         elif mode == "r+":
    222             s = get_writer_for_path(path, driver=driver)(

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: '/vsicurl/https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf' not recognized as a supported file format.

by the way, HDF is supported.
Configuration is really scarce on rasterio.env module and mostly focus on AWS but I’ve test as well this approach with the same result.

import rasterio
with rasterio.env.Env(AZURE_STORAGE_ACCOUNT='modissa', AZURE_STORAGE_SAS_TOKEN='sv=2019-12-12&si=modis-ro&sr=c&sig=jauBgV7FNXyhbjO60XSMr6AnFFd5sl%2BwXsNazkOOFjs%3D'):
    ds_250 = rxr.open_rasterio('https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf', chunks= 'auto')

At the end I’ve been testing at the real base through GDAL following GDAL Virtual File Systems documentation (not pretty sure about the account name but I didn’t had any other idea) but it fails as well with a timeout:

export $AZURE_STORAGE_ACCOUNT='modissa'
export AZURE_STORAGE_SAS_TOKEN='sv=2019-12-12&si=modis-ro&sr=c&sig=jauBgV7FNXyhbjO60XSMr6AnFFd5sl%2BwXsNazkOOFjs%3D'

gdalinfo /vsiaz/https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf

TomAugspurger · December 3, 2021, 3:50pm

Your first example wouldn’t work

rxr.open_rasterio('https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf', chunks= 'auto')

since it doesn’t include the SAS token as a query parameter. I would have hoped that rxr.open_rasterio('https://modissa.blob.core.windows.net/modis-006/MOD09Q1/22/07/2021289/MOD09Q1.A2021289.h22v07.006.2021299111058.hdf?sv=2019-12-12&si=modis-ro&sr=c&sig=jauBgV7FNXyhbjO60XSMr6AnFFd5sl%2BwXsNazkOOFjs%3D', chunks= 'auto') would work, but it fails as well.

Looking at GDAL Virtual File Systems (compressed, network hosted, etc...): /vsimem, /vsizip, /vsitar, /vsicurl, ... — GDAL documentation, it seems like GDAL doesn’t support opening netCDF files with a virtual filesystem:

Notable exceptions are the netCDF, HDF4 and HDF5 drivers.

So unfortunately it seems like downloading the file to disk might be unavoidable (I’m not sure if one of xarray’s open_dataset engines would be able to read this file over the network).

pl.marasco · December 3, 2021, 5:30pm

The real problem isn’t downloading files locally but more how to make them lazy.
As workers can’t get access to local files scalability is gone as tiles must be processed on the notebook pod.
Most probably I’ve to rethink the entire approach.
Thanks for all your inputs

RichardScottOZ · December 5, 2021, 8:39pm

Yes, each worker does part of the job in memory and writes that part back to cloud storage, as far as approach goes. Then you do something at the end…combine etc.

Topic		Replies	Views
Optimizing climatology calculation with Xarray and Dask Science	33	4028	December 6, 2024
Issues getting started with Xarray and Dask on Pangeo Cloud	9	1954	February 8, 2021
Dask RESAMPLING (can't handle large files...) Science	7	1862	February 1, 2023
Reading HDF-EOS (HDF4) files in parallel: GDAL/rasterio/etc Science	6	861	April 15, 2024
How to allow dask workers to read from a requester pays s3 bucket using rasterio? Cloud	4	1516	May 11, 2020

RioXarray & Dask in a cloud env

Related topics