Best practices for efficient zarr/netcdf query for large number (100k+) of lat,lon pairs

Hello everyone,

I’m currently working on a project that involves creating Intensity-Duration-Frequency (IDF) curves for historical flood events. The goal is to understand how era5 precipitation data correlates with flood depth of FEMA National Flood Insurance Program (NFIP) database. To achieve this, I need to query Zarr files for hundreds of thousands of latitude and longitude pairs to extract relevant data points. Even just pulling the precipitation values for the day of the flood event is not executing in multiple hours. using a simple ds.sel(lat=lat,lon=lon) to pull the data.

Despite leveraging Dask with around 100 workers to parallelize and speed up the querying process, I’m finding it challenging to query data for such a large number of points. The sheer volume of queries seems to make the task not feasible with my current approach. I’ve tried doing this file by file as suggested here: Very slow selection of multiple points in large dataset using xarray but I am still having issues querying single datafiles for that many points.

Appreciate any help.

1 Like

what is lat and lon in your code? A list of lat/lon pairs apparently, but how do you pass them? If you do not pass them as DataArray objects with a common dimension (e.g. "points") but as lists or arrays, ds.sel(lat=lat, lon=lon) will give you a n×n matrix instead of the 1D data you seem to be looking for.

(just to check the easy things first, for everything else we might need more information, in particular the chunking scheme of the Dataset and maybe also a text repr of ds)

the lat and lon are being called from a data frame in a lambda function. I realize now, that this may be a slow method if the ds has to be opened each time that function is called?

This might help:

1 Like

Hi @Sam_Cherniak - how is the ERA5 dataset stored? As @keewis mentioned, this will have a big impact. If it’s stored like most public datasets (e.g. a single global slice per time step), you might be fetching the whole slice (i.e. the world) for each lat/lon pair, which would explain why it takes forever even with 100 workers.

Assuming that NFIP database mostly covers CONUS, it’s probably easier to download a subset locally and then interpolate. Here’s a sample script that runs in about a minute for me on my laptop to interpolate 100k lat/lon pairs for a year’s worth of data, with about half the time spent on downloading (this file is about ~1GB) so this should give you an idea of how fast a query like this could run.

import xarray as xr
import requests
import numpy as np

r = requests.get('https://api.oikolab.com/weather',
                 params={'param': 'total_precipitation',
                        'start':'2023-01-01',
                        'end':'2023-12-31',
                        'north':55,
                        'south':20,
                        'west':-125,
                        'east':-65,
                        'format':'netcdf',
                        },
                 headers={'api-key': '<api-key-here>', 
                         }
                 )

with open('data.nc', 'wb') as f:
    f.write(r.content)

lat = np.random.uniform(20, 55, 100_000)
lon = np.random.uniform(-125, -65, 100_000)
lats = xr.DataArray(lat, dims="points", coords={"points": [str(l) for l in zip(lat,lon)]})
lons = xr.DataArray(lat, dims="points", coords={"points": [str(l) for l in zip(lat,lon)]})

ds = xr.open_dataset('data.nc')
subset = ds.interp(latitude=lats, longitude=lons, method='linear')

Note: Oikolab is a weather data service that I run for use cases like this. :slight_smile:

1 Like