How to organize data from polar satellites and irregular grids in datacube format?

I have been exploring the use of VirtualiZarr and Icechunk to manage large satellite datasets efficiently using reference-based storage. I successfully created a virtual dataset for geostationary satellite imagery (INSAT) using VirtualiZarr and stored its references using Icechunk. This allowed multi-file geostationary HDF5 data to be read through Xarray without converting the source files into Zarr.

Applying the same workflow to polar-orbiting GPM DPR Ku Precipitation Profile (2A, Version 7) revealed several additional challenges:

  • The dataset does not use a fixed grid.
  • Each scan line has different latitude/longitude values, so the spatial grid varies for every scan.
  • The HDF5 structure is deeply nested, for example under /FS/VERENV, /FS/ScanTime, etc.
  • When the dataset is opened directly using Xarray, the result appears empty because the data variables are not at the root level:
import xarray as xr

ds = xr.open_dataset(
    "/home/path/2A-ENV.GPM.Ku.HDF5"
)
print(ds)

Output:

<xarray.Dataset>
Dimensions:  ()
Data variables: *empty*

To view the data, a specific HDF5 group has to be manually provided, for example:

xr.open_dataset(path, group="FS/CSF")

Structural Characteristics of the Dataset

The GPM file contains irregular dimensions such as:

CODE I USED:
import h5py

path = "/path/2A.GPM.Ku.V9-20240130.20251123-S214125-E231437.066614.V07C.HDF5"

with h5py.File(path, "r") as f:

    
    def explore(name, obj):
        if isinstance(obj, h5py.Dataset):
            print(f"  [DATASET] {name} -> shape={obj.shape}, dtype={obj.dtype}")
        elif isinstance(obj, h5py.Group):
            print(f"[GROUP] {name}")
    f.visititems(explore)


    year = f['FS']['ScanTime']['Year'][:]
    month = f['FS']['ScanTime']['Month'][:]
    day = f['FS']['ScanTime']['DayOfMonth'][:]
    hour = f['FS']['ScanTime']['Hour'][:]
    minute = f['FS']['ScanTime']['Minute'][:]
    second = f['FS']['ScanTime']['Second'][:]
    millisecond = f['FS']['ScanTime']['MilliSecond'][:]
    print("Year:", year[:5])
    print("Month:", month[:5])
    print("Day:", day[:5])
    print("Hour:", hour[:5])
    print("Minute:", minute[:5])
    print("Second:", second[:5])
    print("Millisecond:", millisecond[:5])

    
    gpm_lat = f['FS']['Latitude'][:]
    gpm_lon = f['FS']['Longitude'][:]
    print("Latitude sample:", gpm_lat[0, :5])
    print("Longitude sample:", gpm_lon[0, :5])

    
    precipRateNearSurface = f['FS']['SLV']['precipRateNearSurface'][:]
    print("precipRateNearSurface sample:", precipRateNearSurface[0, :5])

  
OUTPUT:
[DATASET] AlgorithmRuntimeInfo -> shape=(1,), dtype=|S1173
[GROUP] FS
[GROUP] FS/CSF
  [DATASET] FS/CSF/binBBBottom -> shape=(7991, 49), dtype=int16
  [DATASET] FS/CSF/binBBPeak -> shape=(7991, 49), dtype=int16
  [DATASET] FS/CSF/binBBTop -> shape=(7991, 49), dtype=int16
  [DATASET] FS/CSF/binHeavyIcePrecipBottom -> shape=(7991, 49), dtype=int16
  [DATASET] FS/CSF/binHeavyIcePrecipTop -> shape=(7991, 49), dtype=int16
  [DATASET] FS/CSF/flagAnvil -> shape=(7991, 49), dtype=int8
  [DATASET] FS/CSF/flagBB -> shape=(7991, 49), dtype=int32
  [DATASET] FS/CSF/flagHeavyIcePrecip -> shape=(7991, 49), dtype=int8
  [DATASET] FS/CSF/flagShallowRain -> shape=(7991, 49), dtype=int32
  [DATASET] FS/CSF/heightBB -> shape=(7991, 49), dtype=float32
  [DATASET] FS/CSF/nHeavyIcePrecip -> shape=(7991, 49), dtype=uint8
  [DATASET] FS/CSF/qualityBB -> shape=(7991, 49), dtype=int32
  [DATASET] FS/CSF/qualityTypePrecip -> shape=(7991, 49), dtype=int32
  [DATASET] FS/CSF/typePrecip -> shape=(7991, 49), dtype=int32
  [DATASET] FS/CSF/widthBB -> shape=(7991, 49), dtype=float32
[GROUP] FS/DSD
  [DATASET] FS/DSD/binNode -> shape=(7991, 49, 5), dtype=int16
  [DATASET] FS/DSD/paramRDm -> shape=(7991, 49, 5), dtype=float32
  [DATASET] FS/DSD/phase -> shape=(7991, 49, 176), dtype=uint8
[GROUP] FS/Experimental
  [DATASET] FS/Experimental/precipRateESurface2 -> shape=(7991, 49), dtype=float32
  [DATASET] FS/Experimental/precipRateESurface2Status -> shape=(7991, 49), dtype=uint8
  [DATASET] FS/Experimental/seaIceConcentration -> shape=(7991, 49), dtype=float32
  [DATASET] FS/Experimental/sigmaZeroProfile -> shape=(7991, 49, 7), dtype=float32
[GROUP] FS/FLG
  [DATASET] FS/FLG/flagEcho -> shape=(7991, 49, 176), dtype=int8
  [DATASET] FS/FLG/flagScanPattern -> shape=(7991,), dtype=int16
  [DATASET] FS/FLG/flagSensor -> shape=(7991,), dtype=int8
  [DATASET] FS/FLG/qualityData -> shape=(7991, 49), dtype=int32
  [DATASET] FS/FLG/qualityFlag -> shape=(7991, 49), dtype=int8
  [DATASET] FS/Latitude -> shape=(7991, 49), dtype=float32
  [DATASET] FS/Longitude -> shape=(7991, 49), dtype=float32
[GROUP] FS/PRE
  [DATASET] FS/PRE/adjustFactor -> shape=(7991, 49), dtype=float32
  [DATASET] FS/PRE/binClutterFreeBottom -> shape=(7991, 49), dtype=int16
  [DATASET] FS/PRE/binMirrorImageL2 -> shape=(7991, 49), dtype=int16
  [DATASET] FS/PRE/binRealSurface -> shape=(7991, 49), dtype=int16
  [DATASET] FS/PRE/binStormTop -> shape=(7991, 49), dtype=int16
  [DATASET] FS/PRE/echoCountRealSurface -> shape=(7991, 49), dtype=uint8
  [DATASET] FS/PRE/elevation -> shape=(7991, 49), dtype=float32
  [DATASET] FS/PRE/ellipsoidBinOffset -> shape=(7991, 49), dtype=float32
  [DATASET] FS/PRE/flagPrecip -> shape=(7991, 49), dtype=int32
  [DATASET] FS/PRE/flagSigmaZeroSaturation -> shape=(7991, 49), dtype=uint8
  [DATASET] FS/PRE/height -> shape=(7991, 49, 176), dtype=float32
  [DATASET] FS/PRE/heightStormTop -> shape=(7991, 49), dtype=float32
  [DATASET] FS/PRE/landSurfaceType -> shape=(7991, 49), dtype=int32
  [DATASET] FS/PRE/localZenithAngle -> shape=(7991, 49), dtype=float32
  [DATASET] FS/PRE/sigmaZeroMeasured -> shape=(7991, 49), dtype=float32
  [DATASET] FS/PRE/snRatioAtRealSurface -> shape=(7991, 49), dtype=float32
  [DATASET] FS/PRE/snowIceCover -> shape=(7991, 49), dtype=int8
  [DATASET] FS/PRE/zFactorMeasured -> shape=(7991, 49, 176), dtype=float32
[GROUP] FS/SLV
  [DATASET] FS/SLV/binEchoBottom -> shape=(7991, 49), dtype=int16
  [DATASET] FS/SLV/epsilon -> shape=(7991, 49, 176), dtype=float32
  [DATASET] FS/SLV/flagSLV -> shape=(7991, 49, 176), dtype=int8
  [DATASET] FS/SLV/paramDSD -> shape=(7991, 49, 176, 2), dtype=float32
  [DATASET] FS/SLV/paramNUBF -> shape=(7991, 49, 3), dtype=float32
  [DATASET] FS/SLV/phaseNearSurface -> shape=(7991, 49), dtype=uint8
  [DATASET] FS/SLV/piaFinal -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SLV/piaOffset -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SLV/precipRate -> shape=(7991, 49, 176), dtype=float32
  [DATASET] FS/SLV/precipRateAve24 -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SLV/precipRateESurface -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SLV/precipRateNearSurface -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SLV/precipWater -> shape=(7991, 49, 176), dtype=float32
  [DATASET] FS/SLV/precipWaterIntegrated -> shape=(7991, 49, 2), dtype=float32
  [DATASET] FS/SLV/qualitySLV -> shape=(7991, 49), dtype=int32
  [DATASET] FS/SLV/sigmaZeroCorrected -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SLV/zFactorFinal -> shape=(7991, 49, 176), dtype=float32
  [DATASET] FS/SLV/zFactorFinalESurface -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SLV/zFactorFinalNearSurface -> shape=(7991, 49), dtype=float32
[GROUP] FS/SRT
  [DATASET] FS/SRT/PIAalt -> shape=(7991, 49, 6), dtype=float32
  [DATASET] FS/SRT/PIAhb -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SRT/PIAhybrid -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SRT/PIAweight -> shape=(7991, 49, 6), dtype=float32
  [DATASET] FS/SRT/PIAweightHY -> shape=(7991, 49, 2), dtype=float32
  [DATASET] FS/SRT/RFactorAlt -> shape=(7991, 49, 6), dtype=float32
  [DATASET] FS/SRT/pathAtten -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SRT/refScanID -> shape=(7991, 49, 2, 2), dtype=int16
  [DATASET] FS/SRT/reliabFactor -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SRT/reliabFactorHY -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SRT/reliabFlag -> shape=(7991, 49), dtype=int16
  [DATASET] FS/SRT/reliabFlagHY -> shape=(7991, 49), dtype=int16
  [DATASET] FS/SRT/stddevEff -> shape=(7991, 49, 3), dtype=float32
  [DATASET] FS/SRT/stddevHY -> shape=(7991, 49), dtype=float32
  [DATASET] FS/SRT/zeta -> shape=(7991, 49), dtype=float32
[GROUP] FS/ScanTime
  [DATASET] FS/ScanTime/DayOfMonth -> shape=(7991,), dtype=int8
  [DATASET] FS/ScanTime/DayOfYear -> shape=(7991,), dtype=int16
  [DATASET] FS/ScanTime/Hour -> shape=(7991,), dtype=int8
  [DATASET] FS/ScanTime/MilliSecond -> shape=(7991,), dtype=int16
  [DATASET] FS/ScanTime/Minute -> shape=(7991,), dtype=int8
  [DATASET] FS/ScanTime/Month -> shape=(7991,), dtype=int8
  [DATASET] FS/ScanTime/Second -> shape=(7991,), dtype=int8
  [DATASET] FS/ScanTime/SecondOfDay -> shape=(7991,), dtype=float64
  [DATASET] FS/ScanTime/Year -> shape=(7991,), dtype=int16
[GROUP] FS/VER
  [DATASET] FS/VER/airTemperature -> shape=(7991, 49, 176), dtype=float32
  [DATASET] FS/VER/attenuationNP -> shape=(7991, 49, 176), dtype=float32
  [DATASET] FS/VER/binZeroDeg -> shape=(7991, 49), dtype=int16
  [DATASET] FS/VER/binZeroDegSecondary -> shape=(7991, 49), dtype=int16
  [DATASET] FS/VER/flagInversion -> shape=(7991, 49), dtype=int16
  [DATASET] FS/VER/heightZeroDeg -> shape=(7991, 49), dtype=float32
  [DATASET] FS/VER/piaNP -> shape=(7991, 49, 4), dtype=float32
  [DATASET] FS/VER/piaNPrainFree -> shape=(7991, 49, 4), dtype=float32
  [DATASET] FS/VER/sigmaZeroNPCorrected -> shape=(7991, 49), dtype=float32
[GROUP] FS/navigation
  [DATASET] FS/navigation/dprAlt -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/greenHourAng -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scAlt -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scAttPitchGeoc -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scAttPitchGeod -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scAttRollGeoc -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scAttRollGeod -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scAttYawGeoc -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scAttYawGeod -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scHeadingGround -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scHeadingOrbital -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scLat -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scLon -> shape=(7991,), dtype=float32
  [DATASET] FS/navigation/scPos -> shape=(7991, 3), dtype=float32
  [DATASET] FS/navigation/scVel -> shape=(7991, 3), dtype=float32
  [DATASET] FS/navigation/timeMidScan -> shape=(7991,), dtype=float64
  [DATASET] FS/navigation/timeMidScanOffset -> shape=(7991,), dtype=float64
[GROUP] FS/scanStatus
  [DATASET] FS/scanStatus/FractionalGranuleNumber -> shape=(7991,), dtype=float64
  [DATASET] FS/scanStatus/SCorientation -> shape=(7991,), dtype=int16
  [DATASET] FS/scanStatus/acsModeMidScan -> shape=(7991,), dtype=int8
  [DATASET] FS/scanStatus/dataQuality -> shape=(7991,), dtype=int8
  [DATASET] FS/scanStatus/dataWarning -> shape=(7991,), dtype=int8
  [DATASET] FS/scanStatus/geoError -> shape=(7991,), dtype=int16
  [DATASET] FS/scanStatus/geoWarning -> shape=(7991,), dtype=int16
  [DATASET] FS/scanStatus/limitErrorFlag -> shape=(7991,), dtype=int8
  [DATASET] FS/scanStatus/missing -> shape=(7991,), dtype=int8
  [DATASET] FS/scanStatus/modeStatus -> shape=(7991,), dtype=int8
  [DATASET] FS/scanStatus/operationalMode -> shape=(7991,), dtype=int8
  [DATASET] FS/scanStatus/pointingStatus -> shape=(7991,), dtype=int16
  [DATASET] FS/scanStatus/targetSelectionMidScan -> shape=(7991,), dtype=int8
  [DATASET] FS/sunLocalTime -> shape=(7991, 49), dtype=float32


Year: [2025 2025 2025 2025 2025]
Month: [11 11 11 11 11]
Day: [23 23 23 23 23]
Hour: [21 21 21 21 21]
Minute: [41 41 41 41 41]
Second: [26 26 27 28 28]
Millisecond: [ 22 722 422 122 822]


Latitude sample: [-63.857487 -63.912132 -63.96672  -64.020584 -64.07394 ]
Longitude sample: [157.35304 157.35732 157.36151 157.3656  157.36958]


precipRateNearSurface sample: [0. 0. 0. 0. 0.]


There is no static two-dimensional grid. The array shape is based on scan lines:
number of scans Ă— number of rays Ă— vertical levels.

Unlike geostationary data, it is not sufficient to simply wrap all files with VirtualiZarr, because the metadata must first be interpreted to produce a coherent data model.

How can we work with this type of data for building datacubes without actually covnerting the original files?

1 Like

This is a really interesting post and a hard problem. Fundamentally, your data don’t form a data cube. This type of irregular swath data is not a great fit for the Xarray data model.

We’ve been working on some ideas for how to handle this and other level 2 satellite data in Zarr. The key question is what do you actually want to do with the data? Could you say a bit more about the analysis you’re hoping to perform.

I was trying to use spatio-temporal subsetting queries on this kind of data, so from all timestamps and swaths I want to get the data for the region defined by lat_min,lat_max,lon_min,lon_max and between the time range t1 and t2(This can be done using numpy but yeah loading files using xarray gives a good overview about it). This should return all the satellite overpasses in that region during the specified time duration. I also wanted to create reference files for the data so I don’t have to duplicate anything, and then that referenced data can be used in any way required without copying all the original data.And then it will be easier as well to use it with other satellites to find intersection regions and to check if it’s matching spatio temporally.

We’ve been using parquet for polar orbiters and other point cloud like data. It’s quite nice both for reading and writing. Typically I partition files across days/sensors using hive partitioning, and then within each parquet file one can further optimize query performance by writing row groups of smaller time ranges. The parquet file stores metadata about the min and max value within each row group so that you can avoid loading unnecessary data. Library support is excellent with tools like duckdb and polars for high level stuff and pyarrow for squeezing out maximum performance and ETL.

You may be interested in this dataset Brightband launches NNJA-AI dataset

I actually much prefer ETL to parquet than zarr. In typical zarr, one needs to first initialize the dataset before writing data to it, which requires knowing a lot of up front information like the size of the arrays, setting a uniform chunk size, etc. then we need to map our coordinate values to integers indices to assign them in the zarr.

this initialization step also makes the scripting the pipeline more complex. On the other hand with parquet we can just launch our workers in parallel and just make sure they aren’t writing to the same files. Then if we like we can build the index later if we need more efficient querying.

odc, the core is the GDAL C++ warper api, it’s only used like a very long lever via rasterio but the crux in odc is to define the output chunks relative to the input chunks, a slight tweak over what gdal warp already does

if you can arrange your inputs into stac that rasterio can understand as inputs it can reconcile through the warper … then all good, probably odc can do all the work, but otherwise recreate a bit more closely to the the api itself

I will check that out ,are any tutorials available for that? I tried in zarr yesterday so polar satellites grid is not consistent but swath width is so I used these dimensions time x swath_width x (any 3rd dim) and saved lat lon as variables per timestamps so I got a continuous structure.3rd dim is of not consistent size so I just padded it to the highest number available and mentioned in attrs how to use it.

tutorials, I don’t think so - odc is aiming for multidimensional input in future (not just output), but I know already that it will struggle with getting access to what GDAL can do through rasterio

HDF5 is a bit undercooked in GDAL but you can often augment sources fairly simply, and especially since you are talking about data that’s not a regular grid it’s probably more about making sure the dataset GDAL opens recognizes appropriately the coordinates as “geolocation arrays”. I’m interested to explore if you’ll indulge with example data -I’m light on odc experience, but I know how it works under the hood, have good relationships with its creators, and there’s heaps more potential there :innocent:

I am working with gpm dpr ku band data here is what I tried to do.So when you open this data using xarray with open_dataset you will get an empty dataset as this data has groups inside it.We can see that structure when you open it using datatree.

Here is what I tried I stacked multiple gpm files across time x ray x multiple thrid dimension profiles and lat lons are stored as variables instead of dims and coords.

So instead of reading multple hdf files ,we can dirctly use this single file and and getting datasubset across time and lat lon will be very easy.
Problems-As you can see third dims is not of consistent shape so either we assign new dim for each or we can just pad it with nans to the max number(here it’s 176).
I dont know yet if I am going in right direction so correct me If I am wrong.