Reading HDF-EOS (HDF4) files in parallel: GDAL/rasterio/etc

I’m trying to read an entire day’s worth of MODIS granules in parallel. I can read a single tile in one of a few ways, but reading in parallel is still a problem. Reading serial takes about 15 minutes on my machine which is kinda a non-starter for daily data over the past 20 years. I’ve included an example below, but it requires downloading the MODIS data. It is available at

https://e4ftl01.cr.usgs.gov/MOTA/MCD43A4.061/2001.01.01/

This example is for the MODIS NBAR collection, but really applies to any collection. AFAIK, the parallel read fails because GDAL wants an output grid, which I can provide, but I haven’t been able to find sufficient documentation of the GDAL Python bindings to tell me how to do this.

from os import path
from glob import glob
from time import time

import numpy as np
import xarray as xr
import rioxarray as rxr

TESTDIR = '../data/MCD43A4.061/2001/001'
flist = glob(path.join(TESTDIR, '*.hdf'))

#--- PARALLEL? ---
t0 = time()
# Option 1:
dsin = xr.open_mfdataset(flist, engine='rasterio').squeeze(drop=True)

# Option 2:
# Doesn't work, gives
# ERROR 10: Pointer 'hObject' is NULL in 'GDALGetMetadata'.
#dsin = xr.open_mfdataset(flist, engine='rasterio', parallel=True).squeeze(drop=True)

# Read Red, NIR and QCs
redin = dsin['Nadir_Reflectance_Band1'].values.T
nirin = dsin['Nadir_Reflectance_Band2'].values.T

dsin.close()
print('parallel read: {:5.3} minutes elapsed'.format((time() - t0)/60.))

#--- SERIAL ---
t0 = time()
for ff in flist:
    dsin = rxr.open_rasterio(ff).squeeze(drop=True)

    # Read Red, NIR and QCs
    redin = dsin['Nadir_Reflectance_Band1'].values.T
    nirin = dsin['Nadir_Reflectance_Band2'].values.T
dsin.close()
print('serial read:   {:5.3} minutes elapsed'.format((time() - t0)/60.))

is the HDF4 driver available via your rasterio?

with GDAL itself you can use cli

gdalinfo --formats

(“HDF4” would be in the list)

I expect rasterio has a similar function to query that

I’d be using the warper api and probably (some VRT virtualization to get around the format limitations) to read/map these files, but I don’t know if the xarray interface can do that

Yes. I probably was unclear that the code above /works/ for me. It’s the commented out line that doesn’t, presumably because dsin needs to be initialized with the output grid that is the entire Earth rather than the individual granule file. I know how to define this grid, etc., but it’s not clear to me how to do that with the rasterio/xarray combo above that will hopefully give me the ability to parallel read.

I’m happy to call through the warper, but the functionality I’m trying to do is read ~200 of these tiles in parallel. If the warper can do this, happy to go that route.

Hi @briardew - wondering if you figured out how to move pass the error.
I just bumped into something similar. Thanks!

Oh I see, open_mfdataset isn’t geared to grid-resolving methods. The opendatacube and stackstac and related tools are though. I’m not going to try to represent them because my python-fu is poor, but I expect you could set this up to virtualize the warp of each first and then load those. EDIT: you don’t need warp to merge because these are regular grids already.

I’m still more comfortable closer to the GDAL api itself … so I’m looking at the files now, I didn’t expect this but they are already grid-resolved, a Sinusoidal global. I think you can simplify this a lot, I was expecting to see un-referenced arrays with geolocation coordinate arrays on each variable.

See how with just one variable (subdataset in GDAL terms) we have a regular grid in the Sinusoidal global MODIS projection:

gdalinfo HDF4_EOS:EOS_GRID:"MCD43A4.A2001001.h00v08.061.2020079134124.hdf":MOD_Grid_BRDF:Nadir_Reflectance_Band1
Driver: HDF4Image/HDF4 Dataset
Files: MCD43A4.A2001001.h00v08.061.2020079134124.hdf
Size is 2400, 2400
Coordinate System is:
PROJCRS["unnamed",
    BASEGEOGCRS["Unknown datum based upon the custom spheroid",
        DATUM["Not specified (based on custom spheroid)",
            ELLIPSOID["Custom spheroid",6371007.181,0,
                LENGTHUNIT["metre",1,
                    ID["EPSG",9001]]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433,
                ID["EPSG",9122]]]],
    CONVERSION["Sinusoidal",
        METHOD["Sinusoidal"],
        PARAMETER["Longitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["False easting",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["Meter",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["Meter",1]]]

Data axis to CRS axis mapping: 1,2
Origin = (-20015109.353999998420477,1111950.519667000044137)
Pixel Size = (463.312716527916507,-463.312716527916677)
Metadata:
  add_offset=0
  add_offset_err=0
  ALBEDOFILEID=06121997
  ASSOCIATEDINSTRUMENTSHORTNAME.1=MODIS
  ASSOCIATEDINSTRUMENTSHORTNAME.2=MODIS
  ASSOCIATEDPLATFORMSHORTNAME.1=Terra
  ASSOCIATEDPLATFORMSHORTNAME.2=Aqua
  ASSOCIATEDSENSORSHORTNAME.1=MODIS
  ASSOCIATEDSENSORSHORTNAME.2=MODIS
  AUTOMATICQUALITYFLAG.1=Passed
  AUTOMATICQUALITYFLAGEXPLANATION.1=Passed was set as a default value. More algorithm will be developed
  AVERAGENUMBEROBS=0
  BRDFCODEID=AMBRALS_V4.0R1
  BRDFDATABASEVERSION=v1.0500m
  calibrated_nt=5
  CHARACTERISTICBINANGULARSIZE=15.0
  CHARACTERISTICBINSIZE=463.312716527778
  COVERAGECALCULATIONMETHOD=volume
  DATACOLUMNS=2400
  DATAROWS=2400
  DAYNIGHTFLAG=Day
  DESCRREVISION=6.1
  EASTBOUNDINGCOORDINATE=-169.991666651401
  EXCLUSIONGRINGFLAG.1=N
  GEOANYABNORMAL=False
  GEOESTMAXRMSERROR=50.0
  GLOBALGRIDCOLUMNS=86400
  GLOBALGRIDROWS=43200
  GRINGPOINTLATITUDE.1=-0.0068357003079556, 9.99897831672069, 9.9909309627606, 5.67994760615426e-06
  GRINGPOINTLONGITUDE.1=-179.999951582871, 179.928473473918, -169.920147289013, -169.99173290556
  GRINGPOINTSEQUENCENO.1=1, 2, 3, 4
  HDFEOSVersion=HDFEOS_V2.19
  HORIZONTALTILENUMBER=0
  identifier_product_doi=10.5067/MODIS/MCD43A4.061
  identifier_product_doi=10.5067/MODIS/MCD43A4.061
  identifier_product_doi_authority=http://dx.doi.org
  identifier_product_doi_authority=http://dx.doi.org
  INPUTPOINTER=MOD09GA.A2000359.h00v08.061.2020057131732.hdf, MOD09GA.A2000360.h00v08.061.2020057134140.hdf, MOD09GA.A2000361.h00v08.061.2020057140528.hdf, MOD09GA.A2000362.h00v08.061.2020057143634.hdf, MOD09GA.A2000363.h00v08.061.2020057145922.hdf, MOD09GA.A2000364.h00v08.061.2020057161816.hdf, MOD09GA.A2000365.h00v08.061.2020058160732.hdf, MOD09GA.A2000366.h00v08.061.2020058173939.hdf, MOD09GA.A2001001.h00v08.061.2020058184548.hdf, MOD09GA.A2001002.h00v08.061.2020058194846.hdf, MOD09GA.A2001003.h00v08.061.2020058205805.hdf, MOD09GA.A2001004.h00v08.061.2020058221454.hdf, MOD09GA.A2001005.h00v08.061.2020058231539.hdf, MOD09GA.A2001006.h00v08.061.2020059000035.hdf, MOD09GA.A2001007.h00v08.061.2020059002551.hdf, MOD09GA.A2001008.h00v08.061.2020059004730.hdf, MCD43DB.A2000001.h00v08.061.2020078195040.hdf
  LOCALGRANULEID=MCD43A4.A2001001.h00v08.061.2020079134124.hdf
  LOCALVERSIONID=6.1.34
  LONGNAME=MODIS/Terra+Aqua BRDF/Albedo Nadir BRDF-Adjusted Ref Daily L3 Global - 500m
  long_name=Nadir_Reflectance_Band1
  MAXIMUMOBSERVATIONS=0
  NADIRDATARESOLUTION=500m
  NORTHBOUNDINGCOORDINATE=9.99999999910196
  NUMBEROFGRANULES=1
  PARAMETERNAME.1=NOT SET
  PERCENTLANDINTILE=0
  PERCENTNEWBRDFS=0
  PERCENTPROCESSEDINTILE=0
  PERCENTSHAPEFIXEDBRDFS=0
  PERCENTSUBSTITUTEBRDFS=0
  PGEVERSION=6.1.9
  PROCESSINGCENTER=MODAPS
  PROCESSINGENVIRONMENT=Linux minion7433 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  PRODUCTIONDATETIME=2020-03-19T13:41:43.000Z
  QAPERCENTGOODQUALITY=0
  QAPERCENTNOTPRODUCEDCLOUD=0
  QAPERCENTNOTPRODUCEDOTHER=0
  QAPERCENTOTHERQUALITY=0
  RANGEBEGINNINGDATE=2000-12-24
  RANGEBEGINNINGTIME=00:00:00.000000
  RANGEENDINGDATE=2001-01-08
  RANGEENDINGTIME=23:59:59.999999
  REPROCESSINGACTUAL=reprocessed
  REPROCESSINGPLANNED=further update is anticipated
  scale_factor=0.0001
  scale_factor_err=0
  SCIENCEQUALITYFLAG.1=Not Investigated
  SCIENCEQUALITYFLAGEXPLANATION.1=See http://landweb.nascom/nasa.gov/cgi-bin/QA_WWW/qaFlagPage.cgi?sat=aqua the product Science Quality status.
  SETUPFILEID=06121997
  SHORTNAME=MCD43A4
  SOUTHBOUNDINGCOORDINATE=0.0
  SPSOPARAMETERS=2015
  TileID=51000008
  units=reflectance, no units
  valid_range=0, 32766
  VERSIONID=61
  VERTICALTILENUMBER=8
  WESTBOUNDINGCOORDINATE=-179.999999983835
  _FillValue=32767
Corner Coordinates:
Upper Left  (-20015109.354, 1111950.520) (177d13'23.56"E, 10d 0' 0.00"N)
Lower Left  (-20015109.354,       0.000) (180d 0' 0.00"W,  0d 0' 0.01"N)
Upper Right (-18903158.834, 1111950.520) (172d37'21.09"W, 10d 0' 0.00"N)
Lower Right (-18903158.834,       0.000) (170d 0' 0.00"W,  0d 0' 0.01"N)
Center      (-19459134.094,  555975.260) (175d40' 6.50"W,  5d 0' 0.00"N)
Band 1 Block=2400x100 Type=Int16, ColorInterp=Gray
  Description = Nadir_Reflectance_Band1
  NoData Value=32767
  Unit Type: reflectance, no units
  Offset: 0,   Scale:0.0001

I think you can use gdalbuildvrt (or its equivalent in the API or rasterio) to simply merge these. I’ll explore.

Note that building the VRT can’t be done by name (unless using the fully expanded subdataset DSN), it has to be ‘-sd ’ which is the 1-based index of the the variable, here 8 for ‘Nadir_Reflectance_Band1’.

gdalbuildvrt -sd 8 Nadir_Reflectance_Band1.vrt e4ftl01.cr.usgs.gov/MOTA/MCD43A4.061/2001.01.01/*.hdf

Then open_dataset can load:

xr.open_dataset("Nadir_Reflectance_Band1.vrt")

<xarray.Dataset> Size: 13GB
Dimensions:      (band: 1, x: 86400, y: 38400)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 691kB -2.001e+07 -2.001e+07 ... 2.001e+07 2.001e+07
  * y            (y) float64 307kB 7.783e+06 7.783e+06 ... -1.001e+07 -1.001e+07
    spatial_ref  int64 8B ...
Data variables:
    band_data    (band, y, x) float32 13GB ...

I would more likely warp that VRT to the target grid, then open it (I don’t know if there’s ways to template adding the other variables, other than loading each DataArray and putting them together).

Also I’m not across other methods of mosaicing for xarray, so won’t speculate - I appreciate this example it’s relevant to avenues I’m exploring (and I’ve setup a feature request to be able to name ‘-sd’ for those simple cases). Another option in future GDAL will be to index these with the GTI driver.

It’s unfortunate to be spread over so many files, and really wasteful to have the entire sinusoidal grid materialized so don’t do that - but I guess it reflects that the entire set can’t be stored sparsely in HDF4(?) and the project adopted sinusoidal from the ragged-array L3bin scheme from SeaWiFS afaik.

One option to warp to materalized form is (just guessing at a sensible resolution, using sparse blocks and compress gives a 1.9Gb tif):

gdalwarp Nadir_Reflectance_Band1.vrt Nadir_Reflectance_Band1.tif -multi -wo NUM_THREADS=ALL_CPUS -t_srs EPSG:4326 -tr 0.004 0.004 -te -180 -90 180 90  -co COMPRESS=LZW -co SPARSE_OK=YES -of COG

##Creating output file that is 90000P x 45000L.

But, set a vrt target rather than tif and then it will be a text description of the warp job, and it will only materialize as needed after load.

I might be oversharing, but happy to be learning slowly - I created the warped VRT and then lazy loaded it, and run a dask query and it all seems fine. I’m not sure if this can be done purely with rasterio but I’d like to find out.

from osgeo import gdal
import rasterio
import xarray as xr
import glob
from osgeo import gdal
gdal.UseExceptions()
import time
import tempfile

## I put the 299 HDF4 fies into a local dir "pangeo_hdf": 
hdf  = glob.glob("pangeo_hdf/*.hdf")

band1_src = [f"HDF4_EOS:EOS_GRID:\"{file}\":MOD_Grid_BRDF:Nadir_Reflectance_Band1" for file in hdf]
band2_src = [f"HDF4_EOS:EOS_GRID:\"{file}\":MOD_Grid_BRDF:Nadir_Reflectance_Band2" for file in hdf]

## we can capture the VRT text in memory here
band1 = gdal.BuildVRT("", band1_src)
band2 = gdal.BuildVRT("", band2_src)

opts = gdal.WarpOptions(outputBounds = [-180, -90, 180, 90], width = 86400, height = 43200, dstSRS = "EPSG:4326", format = "VRT")

wband1vrt = tempfile.NamedTemporaryFile(suffix = ".vrt", prefix = "band1_")
wband2vrt = tempfile.NamedTemporaryFile(suffix = ".vrt", prefix = "band2_")

## but we can't chain that VRT into this WarpedVRT, so we actually create files here
w1 = gdal.Warp(wband1vrt.name, band1.GetMetadata("xml:VRT")[0], options = opts)
w2 = gdal.Warp(wband2vrt.name, band2.GetMetadata("xml:VRT")[0], options = opts)

## destroy so that file gets flushed into existence
w1 = None
w2 = None

## literal virtual files
ds = xr.merge([xr.open_dataset(wband1vrt.name,  chunks = {}, engine = "rasterio").band_data.rename("Band1"), 
               xr.open_dataset(wband2vrt.name,  chunks = {}, engine = "rasterio").band_data.rename("Band2")])


## takes about 4 minutes on 48 cores
print(ds.Band1.mean().values)
1 Like