Hey I am working with gpm dpr ku band data and data when read through xarray creates phony dims.The structure of hdf files is like this
Dataset: AlgorithmRuntimeInfo, shape: (1,), dtype: |S1175
Group: FS
Group: FS/CSF
Dataset: FS/CSF/binBBBottom, shape: (7935, 49), dtype: int16
Dataset: FS/CSF/binBBPeak, shape: (7935, 49), dtype: int16
Dataset: FS/CSF/binBBTop, shape: (7935, 49), dtype: int16
Dataset: FS/CSF/binHeavyIcePrecipBottom, shape: (7935, 49), dtype: int16
Dataset: FS/CSF/binHeavyIcePrecipTop, shape: (7935, 49), dtype: int16
Dataset: FS/CSF/flagAnvil, shape: (7935, 49), dtype: int8
Dataset: FS/CSF/flagBB, shape: (7935, 49), dtype: int32
Dataset: FS/CSF/flagHeavyIcePrecip, shape: (7935, 49), dtype: int8
Dataset: FS/CSF/flagShallowRain, shape: (7935, 49), dtype: int32
Dataset: FS/CSF/heightBB, shape: (7935, 49), dtype: float32
Dataset: FS/CSF/nHeavyIcePrecip, shape: (7935, 49), dtype: uint8
Dataset: FS/CSF/qualityBB, shape: (7935, 49), dtype: int32
Dataset: FS/CSF/qualityTypePrecip, shape: (7935, 49), dtype: int32
Dataset: FS/CSF/typePrecip, shape: (7935, 49), dtype: int32
Dataset: FS/CSF/widthBB, shape: (7935, 49), dtype: float32
Group: FS/DSD
Dataset: FS/DSD/binNode, shape: (7935, 49, 5), dtype: int16
Dataset: FS/DSD/paramRDm, shape: (7935, 49, 5), dtype: float32
Dataset: FS/DSD/phase, shape: (7935, 49, 176), dtype: uint8
Group: FS/Experimental
Dataset: FS/Experimental/precipRateESurface2, shape: (7935, 49), dtype: float32
Dataset: FS/Experimental/precipRateESurface2Status, shape: (7935, 49), dtype: uint8
Dataset: FS/Experimental/seaIceConcentration, shape: (7935, 49), dtype: float32
Dataset: FS/Experimental/sigmaZeroProfile, shape: (7935, 49, 7), dtype: float32
Group: FS/FLG
Dataset: FS/FLG/flagEcho, shape: (7935, 49, 176), dtype: int8
Dataset: FS/FLG/flagScanPattern, shape: (7935,), dtype: int16
Dataset: FS/FLG/flagSensor, shape: (7935,), dtype: int8
Dataset: FS/FLG/qualityData, shape: (7935, 49), dtype: int32
Dataset: FS/FLG/qualityFlag, shape: (7935, 49), dtype: int8
Dataset: FS/Latitude, shape: (7935, 49), dtype: float32
Dataset: FS/Longitude, shape: (7935, 49), dtype: float32
Group: FS/PRE
Dataset: FS/PRE/adjustFactor, shape: (7935, 49), dtype: float32
Dataset: FS/PRE/binClutterFreeBottom, shape: (7935, 49), dtype: int16
Dataset: FS/PRE/binMirrorImageL2, shape: (7935, 49), dtype: int16
Dataset: FS/PRE/binRealSurface, shape: (7935, 49), dtype: int16
Dataset: FS/PRE/binStormTop, shape: (7935, 49), dtype: int16
Dataset: FS/PRE/echoCountRealSurface, shape: (7935, 49), dtype: uint8
Dataset: FS/PRE/elevation, shape: (7935, 49), dtype: float32
Dataset: FS/PRE/ellipsoidBinOffset, shape: (7935, 49), dtype: float32
Dataset: FS/PRE/flagPrecip, shape: (7935, 49), dtype: int32
Dataset: FS/PRE/flagSigmaZeroSaturation, shape: (7935, 49), dtype: uint8
Dataset: FS/PRE/height, shape: (7935, 49, 176), dtype: float32
Dataset: FS/PRE/heightStormTop, shape: (7935, 49), dtype: float32
Dataset: FS/PRE/landSurfaceType, shape: (7935, 49), dtype: int32
Dataset: FS/PRE/localZenithAngle, shape: (7935, 49), dtype: float32
Dataset: FS/PRE/sigmaZeroMeasured, shape: (7935, 49), dtype: float32
Dataset: FS/PRE/snRatioAtRealSurface, shape: (7935, 49), dtype: float32
Dataset: FS/PRE/snowIceCover, shape: (7935, 49), dtype: int8
Dataset: FS/PRE/zFactorMeasured, shape: (7935, 49, 176), dtype: float32
Group: FS/SLV
Dataset: FS/SLV/binEchoBottom, shape: (7935, 49), dtype: int16
Dataset: FS/SLV/epsilon, shape: (7935, 49, 176), dtype: float32
Dataset: FS/SLV/flagSLV, shape: (7935, 49, 176), dtype: int8
Dataset: FS/SLV/paramDSD, shape: (7935, 49, 176, 2), dtype: float32
Dataset: FS/SLV/paramNUBF, shape: (7935, 49, 3), dtype: float32
Dataset: FS/SLV/phaseNearSurface, shape: (7935, 49), dtype: uint8
Dataset: FS/SLV/piaFinal, shape: (7935, 49), dtype: float32
Dataset: FS/SLV/piaOffset, shape: (7935, 49), dtype: float32
Dataset: FS/SLV/precipRate, shape: (7935, 49, 176), dtype: float32
Dataset: FS/SLV/precipRateAve24, shape: (7935, 49), dtype: float32
Dataset: FS/SLV/precipRateESurface, shape: (7935, 49), dtype: float32
Dataset: FS/SLV/precipRateNearSurface, shape: (7935, 49), dtype: float32
Dataset: FS/SLV/precipWater, shape: (7935, 49, 176), dtype: float32
Dataset: FS/SLV/precipWaterIntegrated, shape: (7935, 49, 2), dtype: float32
Dataset: FS/SLV/qualitySLV, shape: (7935, 49), dtype: int32
Dataset: FS/SLV/sigmaZeroCorrected, shape: (7935, 49), dtype: float32
Dataset: FS/SLV/zFactorFinal, shape: (7935, 49, 176), dtype: float32
Dataset: FS/SLV/zFactorFinalESurface, shape: (7935, 49), dtype: float32
Dataset: FS/SLV/zFactorFinalNearSurface, shape: (7935, 49), dtype: float32
Group: FS/SRT
Dataset: FS/SRT/PIAalt, shape: (7935, 49, 6), dtype: float32
Dataset: FS/SRT/PIAhb, shape: (7935, 49), dtype: float32
Dataset: FS/SRT/PIAhybrid, shape: (7935, 49), dtype: float32
Dataset: FS/SRT/PIAweight, shape: (7935, 49, 6), dtype: float32
Dataset: FS/SRT/PIAweightHY, shape: (7935, 49, 2), dtype: float32
Dataset: FS/SRT/RFactorAlt, shape: (7935, 49, 6), dtype: float32
Dataset: FS/SRT/pathAtten, shape: (7935, 49), dtype: float32
Dataset: FS/SRT/refScanID, shape: (7935, 49, 2, 2), dtype: int16
Dataset: FS/SRT/reliabFactor, shape: (7935, 49), dtype: float32
Dataset: FS/SRT/reliabFactorHY, shape: (7935, 49), dtype: float32
Dataset: FS/SRT/reliabFlag, shape: (7935, 49), dtype: int16
Dataset: FS/SRT/reliabFlagHY, shape: (7935, 49), dtype: int16
Dataset: FS/SRT/stddevEff, shape: (7935, 49, 3), dtype: float32
Dataset: FS/SRT/stddevHY, shape: (7935, 49), dtype: float32
Dataset: FS/SRT/zeta, shape: (7935, 49), dtype: float32
Group: FS/ScanTime
Dataset: FS/ScanTime/DayOfMonth, shape: (7935,), dtype: int8
Dataset: FS/ScanTime/DayOfYear, shape: (7935,), dtype: int16
Dataset: FS/ScanTime/Hour, shape: (7935,), dtype: int8
Dataset: FS/ScanTime/MilliSecond, shape: (7935,), dtype: int16
Dataset: FS/ScanTime/Minute, shape: (7935,), dtype: int8
Dataset: FS/ScanTime/Month, shape: (7935,), dtype: int8
Dataset: FS/ScanTime/Second, shape: (7935,), dtype: int8
Dataset: FS/ScanTime/SecondOfDay, shape: (7935,), dtype: float64
Dataset: FS/ScanTime/Year, shape: (7935,), dtype: int16
Group: FS/VER
Dataset: FS/VER/airTemperature, shape: (7935, 49, 176), dtype: float32
Dataset: FS/VER/attenuationNP, shape: (7935, 49, 176), dtype: float32
Dataset: FS/VER/binZeroDeg, shape: (7935, 49), dtype: int16
Dataset: FS/VER/binZeroDegSecondary, shape: (7935, 49), dtype: int16
Dataset: FS/VER/flagInversion, shape: (7935, 49), dtype: int16
Dataset: FS/VER/heightZeroDeg, shape: (7935, 49), dtype: float32
Dataset: FS/VER/piaNP, shape: (7935, 49, 4), dtype: float32
Dataset: FS/VER/piaNPrainFree, shape: (7935, 49, 4), dtype: float32
Dataset: FS/VER/sigmaZeroNPCorrected, shape: (7935, 49), dtype: float32
Group: FS/navigation
Dataset: FS/navigation/dprAlt, shape: (7935,), dtype: float32
Dataset: FS/navigation/greenHourAng, shape: (7935,), dtype: float32
Dataset: FS/navigation/scAlt, shape: (7935,), dtype: float32
Dataset: FS/navigation/scAttPitchGeoc, shape: (7935,), dtype: float32
Dataset: FS/navigation/scAttPitchGeod, shape: (7935,), dtype: float32
Dataset: FS/navigation/scAttRollGeoc, shape: (7935,), dtype: float32
Dataset: FS/navigation/scAttRollGeod, shape: (7935,), dtype: float32
Dataset: FS/navigation/scAttYawGeoc, shape: (7935,), dtype: float32
Dataset: FS/navigation/scAttYawGeod, shape: (7935,), dtype: float32
Dataset: FS/navigation/scHeadingGround, shape: (7935,), dtype: float32
Dataset: FS/navigation/scHeadingOrbital, shape: (7935,), dtype: float32
Dataset: FS/navigation/scLat, shape: (7935,), dtype: float32
Dataset: FS/navigation/scLon, shape: (7935,), dtype: float32
Dataset: FS/navigation/scPos, shape: (7935, 3), dtype: float32
Dataset: FS/navigation/scVel, shape: (7935, 3), dtype: float32
Dataset: FS/navigation/timeMidScan, shape: (7935,), dtype: float64
Dataset: FS/navigation/timeMidScanOffset, shape: (7935,), dtype: float64
Group: FS/scanStatus
Dataset: FS/scanStatus/FractionalGranuleNumber, shape: (7935,), dtype: float64
Dataset: FS/scanStatus/SCorientation, shape: (7935,), dtype: int16
Dataset: FS/scanStatus/acsModeMidScan, shape: (7935,), dtype: int8
Dataset: FS/scanStatus/dataQuality, shape: (7935,), dtype: int8
Dataset: FS/scanStatus/dataWarning, shape: (7935,), dtype: int8
Dataset: FS/scanStatus/geoError, shape: (7935,), dtype: int16
Dataset: FS/scanStatus/geoWarning, shape: (7935,), dtype: int16
Dataset: FS/scanStatus/limitErrorFlag, shape: (7935,), dtype: int8
Dataset: FS/scanStatus/missing, shape: (7935,), dtype: int8
Dataset: FS/scanStatus/modeStatus, shape: (7935,), dtype: int8
Dataset: FS/scanStatus/operationalMode, shape: (7935,), dtype: int8
Dataset: FS/scanStatus/pointingStatus, shape: (7935,), dtype: int16
Dataset: FS/scanStatus/targetSelectionMidScan, shape: (7935,), dtype: int8
Dataset: FS/sunLocalTime, shape: (7935, 49), dtype: float32
Because this is swath data, the latitude and longitude values change with each orbit, so there’s no consistent grid. To handle this, I assign dimensions based on the swath structure: time (which differs for each file for example, one file may have 7,935 samples while another has 7,544), scan_width (which is consistently 49), and an additional third dimension for the various data variables. Since the latitude and longitude arrays change for every swath, I treat them as regular variables rather than coordinates.
My workflow is:
-
Read each HDF file (sorted by time, using timestamps extracted from the filenames).
-
Extract the variables.
-
Assign the appropriate dimensions.
-
Store everything in a Zarr dataset.
The issue is performance: each file currently takes more than 150 seconds to process, which seems unusually slow. I also attempted downloading files directly from GES DISC using Python, but even that takes around 30 minutes per file, which is far from ideal.
FOLDERS = [
r"D:\Naveen\GPM_DPR\GPM_2AKu_India_2021_Monsoon",
r"D:\Naveen\GPM_DPR\GPM_2AKu_India_2022_Monsoon"
]OUTDIR = “F:\gpmdatacube”
os.makedirs(OUTDIR, exist_ok=True)ZARR = os.path.join(OUTDIR, “gpmfullFS.zarr”)
BAD_LOG = os.path.join(OUTDIR, “bad_files.log”)
files =for folder in FOLDERS:
for f in os.listdir(folder):
if not f.endswith(“.HDF5”):
continue
fp = os.path.join(folder, f)try: with h5py.File(fp, "r") as h: t = pd.Timestamp( int(h["FS/ScanTime/Year"][0]), int(h["FS/ScanTime/Month"][0]), int(h["FS/ScanTime/DayOfMonth"][0]), int(h["FS/ScanTime/Hour"][0]), int(h["FS/ScanTime/Minute"][0]), int(h["FS/ScanTime/Second"][0]), ) files.append((t, fp)) except Exception as e: with open(BAD_LOG, "a") as log: log.write(f"{fp}\n{e}\n\n")files.sort(key=lambda x: x[0])
files = [f for _, f in files]print(f"
Total valid files: {len(files)}")
DIM_BY_LEN = {
49: “nray”,
176: “nbin”,
5: “nNode”,
4: “nNP”,
3: “nsdew”,
2: “two”,
6: “method”,
7: “nbinSZP”,
}SPECIAL = {
“refScanID”: (“foreBack”, “nearFar”),
“paramDSD”: (“nDSD”,),
“paramNUBF”: (“nNUBF”,),
“precipWaterIntegrated”: (“LS”,),
“scPos”: (“XYZ”,),
“scVel”: (“XYZ”,),
}
def load_file(fp):with h5py.File(fp, "r") as h: time = pd.to_datetime(dict( year=h["FS/ScanTime/Year"][()], month=h["FS/ScanTime/Month"][()], day=h["FS/ScanTime/DayOfMonth"][()], hour=h["FS/ScanTime/Hour"][()], minute=h["FS/ScanTime/Minute"][()], second=h["FS/ScanTime/Second"][()], )) vars = {} def visit(name, obj): if not isinstance(obj, h5py.Dataset): return if not name.startswith("FS/"): return if "AlgorithmRuntimeInfo" in name: return shp = obj.shape dims = ["nscan"] for i, L in enumerate(shp[1:]): d = DIM_BY_LEN.get(L, f"dim{L}") for key, repl in SPECIAL.items(): if key in name and i < len(repl): d = repl[i] dims.append(d) vars[name.replace("/", "_")] = xr.DataArray( obj[()], dims=tuple(dims) ) h.visititems(visit) return xr.Dataset(vars).assign_coords(nscan=("nscan", time)) import time first = True t_global = time.perf_counter() for i, fp in enumerate(files, 1): t_file = time.perf_counter() print(f"\n📥 {i}/{len(files)} → {os.path.basename(fp)}") # -------- READ -------- t_read0 = time.perf_counter() try: ds = load_file(fp) except Exception as e: print("❌ skipped:", e) continue t_read = time.perf_counter() - t_read0 # -------- WRITE -------- t_write0 = time.perf_counter() if first: ds.to_zarr(ZARR, mode="w") first = False action = "base write" else: ds.to_zarr(ZARR, mode="a", append_dim="nscan") action = "append" t_write = time.perf_counter() - t_write0 # -------- STATS -------- t_total = time.perf_counter() - t_file elapsed = time.perf_counter() - t_global avg = elapsed / i eta = avg * (len(files) - i) / 3600 print( f"read={t_read:6.2f}s | " f"write={t_write:6.2f}s | " f"total={t_total:6.2f}s | " f"avg={avg:6.2f}s | " f"ETA={eta:5.2f} h | {action}" ) Here are some initial logs1/1035 → 2A.GPM.Ku.V9-20211125.20210601-S043805-E061039.041233.V07A.HDF5
read= 7.66s | write= 74.02s | total= 81.67s | avg= 81.67s | ETA=23.46 h | base write2/1035 → 2A.GPM.Ku.V9-20211125.20210601-S061040-E074314.041234.V07A.HDF5
read= 8.07s | write= 65.56s | total= 73.63s | avg= 77.65s | ETA=22.28 h | append3/1035 → 2A.GPM.Ku.V9-20211125.20210601-S135334-E152608.041239.V07A.HDF5
read= 52.33s | write= 65.41s | total=117.74s | avg= 91.01s | ETA=26.09 h | append4/1035 → 2A.GPM.Ku.V9-20211125.20210601-S152609-E165843.041240.V07A.HDF5
read= 53.60s | write= 85.36s | total=138.96s | avg=103.00s | ETA=29.50 h | append5/1035 → 2A.GPM.Ku.V9-20211125.20210602-S034649-E051923.041248.V07A.HDF5
read= 56.00s | write= 81.67s | total=137.67s | avg=109.93s | ETA=31.45 h | append6/1035 → 2A.GPM.Ku.V9-20211125.20210602-S051924-E065158.041249.V07A.HDF5
read= 55.10s | write= 81.70s | total=136.80s | avg=114.41s | ETA=32.70 h | append
..
.
.
137/1035 → 2A.GPM.Ku.V9-20211125.20210703-S045549-E062820.041731.V07A.HDF5
read= 50.34s | write=150.47s | total=200.81s | avg=170.65s | ETA=42.57 h | append
I think virtualizarr will not work in this correct me If I am wrong as we have to process the data first.