Passing gdal config to rioxarray functions?

cboettig · December 21, 2023, 3:28am

I’m unclear how to go about passing GDAL configuration to rioxarray functions. For instance, I’d like to use xarray.open_mfdataset to open up a list of URLs lazily by leveraging the GDAL virtual filesystem.

If I first import rioxarray and set engine="rasterio" in my open_mfdataset call, this mostly works as expected. However, sometimes I want to set specific GDAL config options, e.g. to satisfy NASA’s EarthData access. But I don’t see how to do this.

I’ve tried using with rasterio.Env and I’ve tried setting environmental variables, but no luck. (code below). I can confirm that I have no trouble using gdalinfo to open these files on the virtual filesystem when I set the same GDAL environmental variables, so the credentials themselves are not at fault. here’s what I’m trying:

import earthaccess
import rioxarray
import rasterio
import xarray as xr
import os
from pathlib import Path

cookies = os.path.expanduser("~/.urs_cookies")
Path(cookies).touch()

results = earthaccess.search_data(
    short_name="MUR-JPL-L4-GLOB-v4.1",
    temporal=("2019-01-01", "2019-12-31"),
)
# files = earthaccess.open(results)

data_links = [granule.data_links(access="external") for granule in results]

# use raterio env to set gdal env?  Doesn't work
with rasterio.Env(GDAL_HTTP_COOKIEFILE=cookies, 
                  GDAL_HTTP_COOKIEJAR=cookies, 
                  GDAL_HTTP_NETRC="YES"):
  ds = xr.open_mfdataset(data_links, engine = "rasterio")

p.s. Yes I know the earthaccess package provides a mechanism that wraps fsspec so that xr.open_mfdataset can open these files (without GDAL if I understand correctly). That’s great, but I’m looking for a generic solution to pass GDAL configuration options to packages that are already using GDAL.

scottyhq · December 21, 2023, 9:15pm

@cboettig you’re probably getting a misleading error message here. Rioxarray does have access to the environment variables, but open_mfdataset expects a list of URLs and you’re passing a list of lists. I see this error message:

TypeError: invalid path or file: ['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc']

Your data_links looks like this:

[['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'],
 ['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'],
 ['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190103090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc']]

It works if you use a flat list of URLs, but access will likely not be very efficient. You might be better off pre-downloading entire files to a scratch space if ultimately your workflow is reading all the data:

links = [x[0] for x in data_links]

with rasterio.Env(GDAL_HTTP_COOKIEFILE=cookies, 
                  GDAL_HTTP_COOKIEJAR=cookies, 
                  GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR'):
  
    ds = xr.open_mfdataset(links[:3], engine="rasterio")

cboettig · December 22, 2023, 9:02pm

Thanks @scottyhq , you’re right on of course, that was my mistake. It does work when I provide a proper list of URLs, though I am still a bit unclear if those GDAL env vars will be appropriately propagated in when xarray uses dask for a distributed computation?

Sorry I should have provided a more specific explanation of my use case – I’m interested in subsetting only a small spatial area from a single variable of these files. I understand that if I was using all the data it would be better to download first. But given that these netcdf serializations have global coverage and many layers of different variables, I’m probably extracting something like less than 1% of the data contained in the files over a potentially many year span, (I think every 4 years of data ~ 1 terrabyte if you’re downloading the whole thing), so I’d really like a range-request based approach.

Here’s a draft notebook with a more minimal example

github.com

espm-157/nasa-topst-env-justice/blob/main/drafts/xarray.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fdba4406-f5eb-44c0-bab9-c222cd847a21",
   "metadata": {},
   "outputs": [],
   "source": [
    "import earthaccess\n",
    "import rioxarray\n",
    "import rasterio\n",
    "import xarray as xr\n",
    "import numpy as np\n",
    "from timebudget import timebudget"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,

This file has been truncated. show original

I’m coming from R, where we don’t have an equivalent to fsspec to provide a virtual filesystem, but using the gdal virtual filesystem to subset from this data takes about 3min (nasa-topst-env-justice/drafts/sst.R at main · espm-157/nasa-topst-env-justice · GitHub), about the same as in the earthdata fsspec example. I’d love to be able to see how to do that in xarray+rasterio+gdal approach, it seems possible. odc.stac does a nice job of this in python, but the examples all seem cog based.

Topic		Replies	Views
Rioxarray reading COG (from Earth Engine) with bottom > top Cloud	1	46	May 22, 2025
Failed to open MODIS .hdf4 files Science	7	2689	September 6, 2023
Opening cloud data without using intake Pangeo Cloud Support	2	654	July 17, 2020
Can a reprojection/change of CRS operation be done lazily using rioXarray? Science	30	803	November 7, 2024
RioXarray & Dask in a cloud env Cloud	7	1655	December 5, 2021

Passing gdal config to rioxarray functions?

Related topics