I have been exploring the use of VirtualiZarr and Icechunk to manage large satellite datasets efficiently using reference-based storage. I successfully created a virtual dataset for geostationary satellite imagery (INSAT) using VirtualiZarr and stored its references using Icechunk. This allowed multi-file geostationary HDF5 data to be read through Xarray without converting the source files into Zarr.
Applying the same workflow to polar-orbiting GPM DPR Ku Precipitation Profile (2A, Version 7) revealed several additional challenges:
- The dataset does not use a fixed grid.
- Each scan line has different latitude/longitude values, so the spatial grid varies for every scan.
- The HDF5 structure is deeply nested, for example under
/FS/VERENV,/FS/ScanTime, etc. - When the dataset is opened directly using Xarray, the result appears empty because the data variables are not at the root level:
import xarray as xr
ds = xr.open_dataset(
"/home/path/2A-ENV.GPM.Ku.HDF5"
)
print(ds)
Output:
<xarray.Dataset>
Dimensions: ()
Data variables: *empty*
To view the data, a specific HDF5 group has to be manually provided, for example:
xr.open_dataset(path, group="FS/CSF")
Structural Characteristics of the Dataset
The GPM file contains irregular dimensions such as:
CODE I USED:
import h5py
path = "/path/2A.GPM.Ku.V9-20240130.20251123-S214125-E231437.066614.V07C.HDF5"
with h5py.File(path, "r") as f:
def explore(name, obj):
if isinstance(obj, h5py.Dataset):
print(f" [DATASET] {name} -> shape={obj.shape}, dtype={obj.dtype}")
elif isinstance(obj, h5py.Group):
print(f"[GROUP] {name}")
f.visititems(explore)
year = f['FS']['ScanTime']['Year'][:]
month = f['FS']['ScanTime']['Month'][:]
day = f['FS']['ScanTime']['DayOfMonth'][:]
hour = f['FS']['ScanTime']['Hour'][:]
minute = f['FS']['ScanTime']['Minute'][:]
second = f['FS']['ScanTime']['Second'][:]
millisecond = f['FS']['ScanTime']['MilliSecond'][:]
print("Year:", year[:5])
print("Month:", month[:5])
print("Day:", day[:5])
print("Hour:", hour[:5])
print("Minute:", minute[:5])
print("Second:", second[:5])
print("Millisecond:", millisecond[:5])
gpm_lat = f['FS']['Latitude'][:]
gpm_lon = f['FS']['Longitude'][:]
print("Latitude sample:", gpm_lat[0, :5])
print("Longitude sample:", gpm_lon[0, :5])
precipRateNearSurface = f['FS']['SLV']['precipRateNearSurface'][:]
print("precipRateNearSurface sample:", precipRateNearSurface[0, :5])
OUTPUT:
[DATASET] AlgorithmRuntimeInfo -> shape=(1,), dtype=|S1173
[GROUP] FS
[GROUP] FS/CSF
[DATASET] FS/CSF/binBBBottom -> shape=(7991, 49), dtype=int16
[DATASET] FS/CSF/binBBPeak -> shape=(7991, 49), dtype=int16
[DATASET] FS/CSF/binBBTop -> shape=(7991, 49), dtype=int16
[DATASET] FS/CSF/binHeavyIcePrecipBottom -> shape=(7991, 49), dtype=int16
[DATASET] FS/CSF/binHeavyIcePrecipTop -> shape=(7991, 49), dtype=int16
[DATASET] FS/CSF/flagAnvil -> shape=(7991, 49), dtype=int8
[DATASET] FS/CSF/flagBB -> shape=(7991, 49), dtype=int32
[DATASET] FS/CSF/flagHeavyIcePrecip -> shape=(7991, 49), dtype=int8
[DATASET] FS/CSF/flagShallowRain -> shape=(7991, 49), dtype=int32
[DATASET] FS/CSF/heightBB -> shape=(7991, 49), dtype=float32
[DATASET] FS/CSF/nHeavyIcePrecip -> shape=(7991, 49), dtype=uint8
[DATASET] FS/CSF/qualityBB -> shape=(7991, 49), dtype=int32
[DATASET] FS/CSF/qualityTypePrecip -> shape=(7991, 49), dtype=int32
[DATASET] FS/CSF/typePrecip -> shape=(7991, 49), dtype=int32
[DATASET] FS/CSF/widthBB -> shape=(7991, 49), dtype=float32
[GROUP] FS/DSD
[DATASET] FS/DSD/binNode -> shape=(7991, 49, 5), dtype=int16
[DATASET] FS/DSD/paramRDm -> shape=(7991, 49, 5), dtype=float32
[DATASET] FS/DSD/phase -> shape=(7991, 49, 176), dtype=uint8
[GROUP] FS/Experimental
[DATASET] FS/Experimental/precipRateESurface2 -> shape=(7991, 49), dtype=float32
[DATASET] FS/Experimental/precipRateESurface2Status -> shape=(7991, 49), dtype=uint8
[DATASET] FS/Experimental/seaIceConcentration -> shape=(7991, 49), dtype=float32
[DATASET] FS/Experimental/sigmaZeroProfile -> shape=(7991, 49, 7), dtype=float32
[GROUP] FS/FLG
[DATASET] FS/FLG/flagEcho -> shape=(7991, 49, 176), dtype=int8
[DATASET] FS/FLG/flagScanPattern -> shape=(7991,), dtype=int16
[DATASET] FS/FLG/flagSensor -> shape=(7991,), dtype=int8
[DATASET] FS/FLG/qualityData -> shape=(7991, 49), dtype=int32
[DATASET] FS/FLG/qualityFlag -> shape=(7991, 49), dtype=int8
[DATASET] FS/Latitude -> shape=(7991, 49), dtype=float32
[DATASET] FS/Longitude -> shape=(7991, 49), dtype=float32
[GROUP] FS/PRE
[DATASET] FS/PRE/adjustFactor -> shape=(7991, 49), dtype=float32
[DATASET] FS/PRE/binClutterFreeBottom -> shape=(7991, 49), dtype=int16
[DATASET] FS/PRE/binMirrorImageL2 -> shape=(7991, 49), dtype=int16
[DATASET] FS/PRE/binRealSurface -> shape=(7991, 49), dtype=int16
[DATASET] FS/PRE/binStormTop -> shape=(7991, 49), dtype=int16
[DATASET] FS/PRE/echoCountRealSurface -> shape=(7991, 49), dtype=uint8
[DATASET] FS/PRE/elevation -> shape=(7991, 49), dtype=float32
[DATASET] FS/PRE/ellipsoidBinOffset -> shape=(7991, 49), dtype=float32
[DATASET] FS/PRE/flagPrecip -> shape=(7991, 49), dtype=int32
[DATASET] FS/PRE/flagSigmaZeroSaturation -> shape=(7991, 49), dtype=uint8
[DATASET] FS/PRE/height -> shape=(7991, 49, 176), dtype=float32
[DATASET] FS/PRE/heightStormTop -> shape=(7991, 49), dtype=float32
[DATASET] FS/PRE/landSurfaceType -> shape=(7991, 49), dtype=int32
[DATASET] FS/PRE/localZenithAngle -> shape=(7991, 49), dtype=float32
[DATASET] FS/PRE/sigmaZeroMeasured -> shape=(7991, 49), dtype=float32
[DATASET] FS/PRE/snRatioAtRealSurface -> shape=(7991, 49), dtype=float32
[DATASET] FS/PRE/snowIceCover -> shape=(7991, 49), dtype=int8
[DATASET] FS/PRE/zFactorMeasured -> shape=(7991, 49, 176), dtype=float32
[GROUP] FS/SLV
[DATASET] FS/SLV/binEchoBottom -> shape=(7991, 49), dtype=int16
[DATASET] FS/SLV/epsilon -> shape=(7991, 49, 176), dtype=float32
[DATASET] FS/SLV/flagSLV -> shape=(7991, 49, 176), dtype=int8
[DATASET] FS/SLV/paramDSD -> shape=(7991, 49, 176, 2), dtype=float32
[DATASET] FS/SLV/paramNUBF -> shape=(7991, 49, 3), dtype=float32
[DATASET] FS/SLV/phaseNearSurface -> shape=(7991, 49), dtype=uint8
[DATASET] FS/SLV/piaFinal -> shape=(7991, 49), dtype=float32
[DATASET] FS/SLV/piaOffset -> shape=(7991, 49), dtype=float32
[DATASET] FS/SLV/precipRate -> shape=(7991, 49, 176), dtype=float32
[DATASET] FS/SLV/precipRateAve24 -> shape=(7991, 49), dtype=float32
[DATASET] FS/SLV/precipRateESurface -> shape=(7991, 49), dtype=float32
[DATASET] FS/SLV/precipRateNearSurface -> shape=(7991, 49), dtype=float32
[DATASET] FS/SLV/precipWater -> shape=(7991, 49, 176), dtype=float32
[DATASET] FS/SLV/precipWaterIntegrated -> shape=(7991, 49, 2), dtype=float32
[DATASET] FS/SLV/qualitySLV -> shape=(7991, 49), dtype=int32
[DATASET] FS/SLV/sigmaZeroCorrected -> shape=(7991, 49), dtype=float32
[DATASET] FS/SLV/zFactorFinal -> shape=(7991, 49, 176), dtype=float32
[DATASET] FS/SLV/zFactorFinalESurface -> shape=(7991, 49), dtype=float32
[DATASET] FS/SLV/zFactorFinalNearSurface -> shape=(7991, 49), dtype=float32
[GROUP] FS/SRT
[DATASET] FS/SRT/PIAalt -> shape=(7991, 49, 6), dtype=float32
[DATASET] FS/SRT/PIAhb -> shape=(7991, 49), dtype=float32
[DATASET] FS/SRT/PIAhybrid -> shape=(7991, 49), dtype=float32
[DATASET] FS/SRT/PIAweight -> shape=(7991, 49, 6), dtype=float32
[DATASET] FS/SRT/PIAweightHY -> shape=(7991, 49, 2), dtype=float32
[DATASET] FS/SRT/RFactorAlt -> shape=(7991, 49, 6), dtype=float32
[DATASET] FS/SRT/pathAtten -> shape=(7991, 49), dtype=float32
[DATASET] FS/SRT/refScanID -> shape=(7991, 49, 2, 2), dtype=int16
[DATASET] FS/SRT/reliabFactor -> shape=(7991, 49), dtype=float32
[DATASET] FS/SRT/reliabFactorHY -> shape=(7991, 49), dtype=float32
[DATASET] FS/SRT/reliabFlag -> shape=(7991, 49), dtype=int16
[DATASET] FS/SRT/reliabFlagHY -> shape=(7991, 49), dtype=int16
[DATASET] FS/SRT/stddevEff -> shape=(7991, 49, 3), dtype=float32
[DATASET] FS/SRT/stddevHY -> shape=(7991, 49), dtype=float32
[DATASET] FS/SRT/zeta -> shape=(7991, 49), dtype=float32
[GROUP] FS/ScanTime
[DATASET] FS/ScanTime/DayOfMonth -> shape=(7991,), dtype=int8
[DATASET] FS/ScanTime/DayOfYear -> shape=(7991,), dtype=int16
[DATASET] FS/ScanTime/Hour -> shape=(7991,), dtype=int8
[DATASET] FS/ScanTime/MilliSecond -> shape=(7991,), dtype=int16
[DATASET] FS/ScanTime/Minute -> shape=(7991,), dtype=int8
[DATASET] FS/ScanTime/Month -> shape=(7991,), dtype=int8
[DATASET] FS/ScanTime/Second -> shape=(7991,), dtype=int8
[DATASET] FS/ScanTime/SecondOfDay -> shape=(7991,), dtype=float64
[DATASET] FS/ScanTime/Year -> shape=(7991,), dtype=int16
[GROUP] FS/VER
[DATASET] FS/VER/airTemperature -> shape=(7991, 49, 176), dtype=float32
[DATASET] FS/VER/attenuationNP -> shape=(7991, 49, 176), dtype=float32
[DATASET] FS/VER/binZeroDeg -> shape=(7991, 49), dtype=int16
[DATASET] FS/VER/binZeroDegSecondary -> shape=(7991, 49), dtype=int16
[DATASET] FS/VER/flagInversion -> shape=(7991, 49), dtype=int16
[DATASET] FS/VER/heightZeroDeg -> shape=(7991, 49), dtype=float32
[DATASET] FS/VER/piaNP -> shape=(7991, 49, 4), dtype=float32
[DATASET] FS/VER/piaNPrainFree -> shape=(7991, 49, 4), dtype=float32
[DATASET] FS/VER/sigmaZeroNPCorrected -> shape=(7991, 49), dtype=float32
[GROUP] FS/navigation
[DATASET] FS/navigation/dprAlt -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/greenHourAng -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scAlt -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scAttPitchGeoc -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scAttPitchGeod -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scAttRollGeoc -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scAttRollGeod -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scAttYawGeoc -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scAttYawGeod -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scHeadingGround -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scHeadingOrbital -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scLat -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scLon -> shape=(7991,), dtype=float32
[DATASET] FS/navigation/scPos -> shape=(7991, 3), dtype=float32
[DATASET] FS/navigation/scVel -> shape=(7991, 3), dtype=float32
[DATASET] FS/navigation/timeMidScan -> shape=(7991,), dtype=float64
[DATASET] FS/navigation/timeMidScanOffset -> shape=(7991,), dtype=float64
[GROUP] FS/scanStatus
[DATASET] FS/scanStatus/FractionalGranuleNumber -> shape=(7991,), dtype=float64
[DATASET] FS/scanStatus/SCorientation -> shape=(7991,), dtype=int16
[DATASET] FS/scanStatus/acsModeMidScan -> shape=(7991,), dtype=int8
[DATASET] FS/scanStatus/dataQuality -> shape=(7991,), dtype=int8
[DATASET] FS/scanStatus/dataWarning -> shape=(7991,), dtype=int8
[DATASET] FS/scanStatus/geoError -> shape=(7991,), dtype=int16
[DATASET] FS/scanStatus/geoWarning -> shape=(7991,), dtype=int16
[DATASET] FS/scanStatus/limitErrorFlag -> shape=(7991,), dtype=int8
[DATASET] FS/scanStatus/missing -> shape=(7991,), dtype=int8
[DATASET] FS/scanStatus/modeStatus -> shape=(7991,), dtype=int8
[DATASET] FS/scanStatus/operationalMode -> shape=(7991,), dtype=int8
[DATASET] FS/scanStatus/pointingStatus -> shape=(7991,), dtype=int16
[DATASET] FS/scanStatus/targetSelectionMidScan -> shape=(7991,), dtype=int8
[DATASET] FS/sunLocalTime -> shape=(7991, 49), dtype=float32
Year: [2025 2025 2025 2025 2025]
Month: [11 11 11 11 11]
Day: [23 23 23 23 23]
Hour: [21 21 21 21 21]
Minute: [41 41 41 41 41]
Second: [26 26 27 28 28]
Millisecond: [ 22 722 422 122 822]
Latitude sample: [-63.857487 -63.912132 -63.96672 -64.020584 -64.07394 ]
Longitude sample: [157.35304 157.35732 157.36151 157.3656 157.36958]
precipRateNearSurface sample: [0. 0. 0. 0. 0.]
There is no static two-dimensional grid. The array shape is based on scan lines:
number of scans Ă— number of rays Ă— vertical levels.
Unlike geostationary data, it is not sufficient to simply wrap all files with VirtualiZarr, because the metadata must first be interpreted to produce a coherent data model.
How can we work with this type of data for building datacubes without actually covnerting the original files?



