Reading HDF-EOS (HDF4) files in parallel: GDAL/rasterio/etc

I’m trying to read an entire day’s worth of MODIS granules in parallel. I can read a single tile in one of a few ways, but reading in parallel is still a problem. Reading serial takes about 15 minutes on my machine which is kinda a non-starter for daily data over the past 20 years. I’ve included an example below, but it requires downloading the MODIS data. It is available at

https://e4ftl01.cr.usgs.gov/MOTA/MCD43A4.061/2001.01.01/

This example is for the MODIS NBAR collection, but really applies to any collection. AFAIK, the parallel read fails because GDAL wants an output grid, which I can provide, but I haven’t been able to find sufficient documentation of the GDAL Python bindings to tell me how to do this.

from os import path
from glob import glob
from time import time

import numpy as np
import xarray as xr
import rioxarray as rxr

TESTDIR = '../data/MCD43A4.061/2001/001'
flist = glob(path.join(TESTDIR, '*.hdf'))

#--- PARALLEL? ---
t0 = time()
# Option 1:
dsin = xr.open_mfdataset(flist, engine='rasterio').squeeze(drop=True)

# Option 2:
# Doesn't work, gives
# ERROR 10: Pointer 'hObject' is NULL in 'GDALGetMetadata'.
#dsin = xr.open_mfdataset(flist, engine='rasterio', parallel=True).squeeze(drop=True)

# Read Red, NIR and QCs
redin = dsin['Nadir_Reflectance_Band1'].values.T
nirin = dsin['Nadir_Reflectance_Band2'].values.T

dsin.close()
print('parallel read: {:5.3} minutes elapsed'.format((time() - t0)/60.))

#--- SERIAL ---
t0 = time()
for ff in flist:
    dsin = rxr.open_rasterio(ff).squeeze(drop=True)

    # Read Red, NIR and QCs
    redin = dsin['Nadir_Reflectance_Band1'].values.T
    nirin = dsin['Nadir_Reflectance_Band2'].values.T
dsin.close()
print('serial read:   {:5.3} minutes elapsed'.format((time() - t0)/60.))

is the HDF4 driver available via your rasterio?

with GDAL itself you can use cli

gdalinfo --formats

(“HDF4” would be in the list)

I expect rasterio has a similar function to query that

I’d be using the warper api and probably (some VRT virtualization to get around the format limitations) to read/map these files, but I don’t know if the xarray interface can do that

Yes. I probably was unclear that the code above /works/ for me. It’s the commented out line that doesn’t, presumably because dsin needs to be initialized with the output grid that is the entire Earth rather than the individual granule file. I know how to define this grid, etc., but it’s not clear to me how to do that with the rasterio/xarray combo above that will hopefully give me the ability to parallel read.

I’m happy to call through the warper, but the functionality I’m trying to do is read ~200 of these tiles in parallel. If the warper can do this, happy to go that route.