The National Water Model Reanalysis Zarr dataset on AWS

The National Water Model Reanalysis v2.0 is a 26 year simulation of 2.7 million rivers in the US at hourly intervals. The data was delivered as part of the NOAA Big Data Program to AWS as 227,000+ hourly NetCDF files.

I downloaded (!) and then converted the streamflow files from the reanalysis to a single Zarr dataset with chunks that had a dimension of 100 in the time dimension to facilitate the extraction of time series data. I used rechunker, and to deal with potential input data problems, I looped through the data in month-long chunks, writing and then appending to Zarr at the end of every month. This way I could correct issues with the input data (missing data and bad time stamps), the try again, and on success, append the chunk. See the full notebook for details on the conversion.

The result is a single Zarr dataset on AWS that can be used for time series extraction as well as mapping.

Here’s proof: A sample analysis notebook using this new Zarr dataset.

In this notebook we use a cluster of 20 workers on a Qhub Dask Gateway cluster to both extract time series and compute the annual mean river discharge for a specific year in less than 2 minutes of wall clock.

7 Likes

Rich, Thanks for leading that effort on using ZARR.
If users want to extract and download specific rivers (e.g. certain variables from rivers within a specified time period and bounding box), what would be the recommended procedure?

@LloydBalfour-NOAA, welcome to the Pangeo community and that’s a great question!

Of course we advocate for people to work on Cloud next to the data (in this case that would be AWS us-west-2), but if folks need to download the data they could run a notebook like this, which extracts just the streamflow variable from rivers in the Gulf of Maine region from 2000-01-01 to the present.

If they want to run a script from the command line, they could do:

python Subset_GOM_local.py 

using this script.

To run the notebook or script using a local python environment:

  1. If conda is not installed, follow these instructions, but don’t create the IOOS environment.
  2. Create an custom conda environment using this environment file: conda env create -f nwm_subset_env.yml

To run the notebook with Docker:

  1. Create your own customized container using the environment above, or just use the pangeo/pangeo-notebook container: docker run --rm -i --tty pangeo/pangeo-notebook bash
1 Like

Rich,

Fantastic!!!

Appreciate this.

Regards,
Lloyd Balfour Sr.
Data Architect (TESA)
GID (OWP/NWC/NOAA).
205-960-4754.

So cool Rich! Thanks for sharing.

Would be great to get this into a catalog. However, our catalog is kind of broken now as we are transitioning to Pangeo Forge.

cc @cisaacstern

Incredible, @rsignell !

@rabernat, I forget who told me about this api (maybe @TomAugspurger?), but seems like GitHub - stac-utils/stac-fastapi: STAC API implementation with FastAPI. might be one accessible way of rebooting the catalog.

1 Like

Naive question @rsignell, do you know a way to get the latitude and longitude for a given feature_id using data published by NODD (e.g. Planetary Computer). I’m trying to plot streamflow, but the data published by NODD that includes streamflow is just indexed by feature_id, not latitude and longitude

import xarray as xr
import fsspec

channel_rt = xr.open_dataset(
    fsspec.open("https://noaanwm.blob.core.windows.net/nwm/nwm.20220104/short_range/nwm.t00z.short_range.channel_rt.f001.conus.nc").open()
)
print(channel_rt)

That outputs

<xarray.Dataset>
Dimensions:         (time: 1, reference_time: 1, feature_id: 2776738)
Coordinates:
  * time            (time) datetime64[ns] 2022-01-04T01:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-01-04
  * feature_id      (feature_id) int32 101 179 181 ... 1180001803 1180001804
Data variables:
    crs             |S1 ...
    streamflow      (feature_id) float64 ...
    nudge           (feature_id) float64 ...
    velocity        (feature_id) float64 ...
    qSfcLatRunoff   (feature_id) float64 ...
    qBucket         (feature_id) float64 ...
    qBtmVertRunoff  (feature_id) float64 ...

The land file does seem to be a gridded product file does include x and y variables, but doesn’t include a feature_id.

land = xr.open_dataset(
    fsspec.open("https://noaanwm.blob.core.windows.net/nwm/nwm.20220104/short_range/nwm.t00z.short_range.land.f001.conus.nc").open()
)
print(land)

<xarray.Dataset>
Dimensions:         (time: 1, reference_time: 1, x: 4608, y: 3840)
Coordinates:
  * time            (time) datetime64[ns] 2022-01-04T01:00:00
  * reference_time  (reference_time) datetime64[ns] 2022-01-04
  * x               (x) float64 -2.303e+06 -2.302e+06 ... 2.303e+06 2.304e+06
  * y               (y) float64 -1.92e+06 -1.919e+06 ... 1.918e+06 1.919e+06
Data variables:
    crs             |S1 ...
    SNOWH           (time, y, x) float64 ...
    SNEQV           (time, y, x) float64 ...
    FSNO            (time, y, x) float64 ...
    ACCET           (time, y, x) float64 ...
    SOILSAT_TOP     (time, y, x) float64 ...
    SNOWT_AVG       (time, y, x) float64 ...

Based on channel_rt.feature_id.attrs["comment"], I’m digging through NHDPlusv2 ComIDs, but this seems sufficiently complicated that I figured I’d ask around :slight_smile:

Hmm, strange. The lat/lon locations are in the restrospective datasets.

import fsspec
import xarray as xr

fs = fsspec.filesystem('s3', anon=True)
url = 's3://noaa-nwm-retrospective-2-1-pds/model_output/2020/202001011100.CHRTOUT_DOMAIN1.comp'
ds = xr.open_dataset(fs.open(url), drop_variables='reference_time', chunks={})