Best practices for efficient zarr/netcdf query for large number (100k+) of lat,lon pairs

Sam_Cherniak · February 20, 2024, 6:54pm

Hello everyone,

I’m currently working on a project that involves creating Intensity-Duration-Frequency (IDF) curves for historical flood events. The goal is to understand how era5 precipitation data correlates with flood depth of FEMA National Flood Insurance Program (NFIP) database. To achieve this, I need to query Zarr files for hundreds of thousands of latitude and longitude pairs to extract relevant data points. Even just pulling the precipitation values for the day of the flood event is not executing in multiple hours. using a simple ds.sel(lat=lat,lon=lon) to pull the data.

Despite leveraging Dask with around 100 workers to parallelize and speed up the querying process, I’m finding it challenging to query data for such a large number of points. The sheer volume of queries seems to make the task not feasible with my current approach. I’ve tried doing this file by file as suggested here: Very slow selection of multiple points in large dataset using xarray but I am still having issues querying single datafiles for that many points.

Appreciate any help.

keewis · February 20, 2024, 7:12pm

what is lat and lon in your code? A list of lat/lon pairs apparently, but how do you pass them? If you do not pass them as DataArray objects with a common dimension (e.g. "points") but as lists or arrays, ds.sel(lat=lat, lon=lon) will give you a n×n matrix instead of the 1D data you seem to be looking for.

(just to check the easy things first, for everything else we might need more information, in particular the chunking scheme of the Dataset and maybe also a text repr of ds)

Sam_Cherniak · February 20, 2024, 7:56pm

the lat and lon are being called from a data frame in a lambda function. I realize now, that this may be a slow method if the ds has to be opened each time that function is called?

ahuang11 · February 20, 2024, 8:04pm

This might help:

josephyang · February 22, 2024, 2:46am

Hi @Sam_Cherniak - how is the ERA5 dataset stored? As @keewis mentioned, this will have a big impact. If it’s stored like most public datasets (e.g. a single global slice per time step), you might be fetching the whole slice (i.e. the world) for each lat/lon pair, which would explain why it takes forever even with 100 workers.

Assuming that NFIP database mostly covers CONUS, it’s probably easier to download a subset locally and then interpolate. Here’s a sample script that runs in about a minute for me on my laptop to interpolate 100k lat/lon pairs for a year’s worth of data, with about half the time spent on downloading (this file is about ~1GB) so this should give you an idea of how fast a query like this could run.

import xarray as xr
import requests
import numpy as np

r = requests.get('https://api.oikolab.com/weather',
                 params={'param': 'total_precipitation',
                        'start':'2023-01-01',
                        'end':'2023-12-31',
                        'north':55,
                        'south':20,
                        'west':-125,
                        'east':-65,
                        'format':'netcdf',
                        },
                 headers={'api-key': '<api-key-here>', 
                         }
                 )

with open('data.nc', 'wb') as f:
    f.write(r.content)

lat = np.random.uniform(20, 55, 100_000)
lon = np.random.uniform(-125, -65, 100_000)
lats = xr.DataArray(lat, dims="points", coords={"points": [str(l) for l in zip(lat,lon)]})
lons = xr.DataArray(lat, dims="points", coords={"points": [str(l) for l in zip(lat,lon)]})

ds = xr.open_dataset('data.nc')
subset = ds.interp(latitude=lats, longitude=lons, method='linear')

Note: Oikolab is a weather data service that I run for use cases like this.

Topic		Replies	Views
Optimising Access For Zarr on S3 Data by LAT/LONG (Dask) Data	11	1649	April 25, 2022
The National Water Model Reanalysis Zarr dataset on AWS	7	2678	April 19, 2023
Best Practice for Machine Learning with Huge Datasets Data machine-learning	1	387	October 26, 2024
CMIP6 Zarr datasets on AWS — useful for interactive exploration? Data	1	909	June 10, 2021
Any suggestions for efficiently operating over windows of data? Data	4	1193	February 2, 2023

Best practices for efficient zarr/netcdf query for large number (100k+) of lat,lon pairs

Related topics