AquaScope: one Python schema over 18 water-data collectors (12 agencies), plus hydrology analysis. Looking for feedback on the xarray/interop story

Hi all, I’m a postdoc working on water resources, and I’ve been building AquaScope, an MIT-licensed Python toolkit that does two things: it pulls from 18 collectors across 12 water-data agencies (USGS, FAO AQUASTAT, FAO WaPOR, GEMStat, EU Water Framework Directive, Copernicus ERA5, Taiwan/Japan/Korea agencies, UN SDG 6, OpenMeteo, US Water Quality Portal, and more) behind a single Python schema, and it layers hydrology and ag-water analysis on top (Bulletin 17C flood frequency, baseflow separation, 22 hydrological signatures, FAO-56 crop water).

Repo: GitHub - Rekin226/aquascope: Open-source Python toolkit for water data, hydrology, and agricultural water management — 12 unified collectors (USGS, FAO, GEMStat, EU WFD…), Bulletin 17C flood frequency, FAO-56 ET₀, and an AI methodology recommender. · GitHub

Docs: AquaScope

The problem that started it: every water project re-implements the same data clients. Each service has its own auth, rate limits, units, parameter codes, and pagination, and reconciling them by hand eats the first week of any study. AquaScope forces them into one schema so downstream analysis doesn’t care where a series came from. It’s validated against a 10-catchment CAMELS subset that runs in CI, with 525 tests. As one concrete check, the stationary Log-Pearson III fit on USGS gauge 01646500 (Potomac at Little Falls) lands the 100-year flood at 443,000 cfs vs the FEMA DC value of 475,000 cfs (-6.7%), with all four return periods within ±10% (Q-Q plot and full table are in the repo’s demo notebooks).

Where I want this community’s read, and where I know I’m not yet aligned with the Pangeo stack:

Right now every collector returns records in a unified Pydantic schema. That works cleanly for the point and station sources (gauges, water-quality samples), but it is not xarray/zarr-native. For the gridded sources (ERA5, WaPOR raster) a Pydantic row model is clearly the wrong container, and even for station data many of you would rather get an xarray.Dataset or a tidy DataFrame straight away. There’s a conversion path in the docs, but it’s a bridge, not the native output.

So, genuine questions:

1. For multi-source water data, would you expect the unified layer to emit xarray.Dataset for gridded sources and a DataFrame for tabular ones, rather than one Pydantic schema for everything? Is “one schema to rule them all” the wrong instinct here?

2. Is there prior art in the Pangeo world for a thin, source-agnostic access layer that I should be building on or contributing to (intake drivers, a STAC-based approach) instead of reinventing?

3. For station/gauge time series specifically, what’s the container you actually want to receive? CF-conventions xarray, a tidy DataFrame, something else?

I’d rather refactor toward what this community already uses than grow a parallel ecosystem. Honest pushback very welcome, especially on the interop design. Thanks for reading.