Zarr era5 reading causes huge number of tasks

rabernat · September 8, 2021, 7:44pm

The good news

There is a trick to solving the easy part of your problem.

# open the dataset with no dask chunking
ds_all = xr.open_zarr(zarr_filename_in_google_bucket, consolidated=True, chunks=False)
# select the day you want, lazy, but no dask involved
ds_day = ds_all.isel(time=0)
# now do what you want, including chunking, with your small piece of data

This is a poorly documented but very useful way to work with data. It’s how my llcbot works.
If you have your own system for parallelization, you could use that here to map over many tasks

The bad news

I have never seen code like this…

climatology_mean = ds_all.groupby("time.dayofyear").mean().compute()

…work with data that is chunked in time at the scale of ERA5.

I don’t know if your dataset is public, but here is what Pangeo’s ERA5 data looks like

from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/atmosphere.yaml")
ds = cat['era5_hourly_reanalysis_single_levels_sa'].to_dask()
display(ds)

Each variable 1.3 TB of data, in ~10_000 x 100 MB chunks along the time axis. There are 17 variables

The Dask graph that comes out of groupby("time.dayofyear").mean().compute() creates communication patterns that are extremely memory intensive, since it needs to combine data from every single chunk at every single point in space. There has been lots of work in Dask recently on improving memory management and task scheduling that has slowly improved this use case, which you can read about here:

github.com/dask/distributed

an example that shows the need for memory backpressure

opened 08:03PM - 18 Mar 19 UTC

closed 11:30PM - 30 Jun 21 UTC

rabernat

In my work with large climate datasets, I often concoct calculations that cause …my dask workers to run out of memory, start dumping to disk, and eventually grind my computation to a halt. There are many ways to mitigate this by e.g. using more workers, more memory, better disk-spilling settings, simpler jobs, etc. and these have all been tried over the years with some degree of success. But in this issue, I would like to address what I believe is the root of my problems within the dask scheduler algorithms. The core problem is that the tasks early in my graph generate data faster than it can be consumed downstream, causing data to pile up, eventually overwhelming my workers. Here is a self contained example: ```python import dask.array as dsa # create some random data # assume chunk structure is not under my control, because it originates # from the way the data is laid out in the underlying files shape = (500000, 100, 500) chunks = (100, 100, 500) data = dsa.random.random(shape, chunks=chunks) # now rechunk the data to permit me to do some computations along different axes # this aggregates chunks along axis 0 and dis-aggregates along axis 1 data_rc = data.rechunk((1000, 1, 500)) FACTOR = 15 def my_custom_function(f): # a pretend custom function that would do a bunch of stuff along # axis 0 and 2 and then reduce the data heavily return f.ravel()[::15][None, :] # apply that function to each chunk c1 = math.ceil(data_rc.ravel()[::FACTOR].size / c0) res = data_rc.map_blocks(my_custom_function, dtype=data.dtype, drop_axis=[1, 2], new_axis=[1], chunks=(1, c1)) res.compute() ``` (Perhaps this could be simplified further, but I have done my best to preserve the basic structure of my real problem.) When I watch this execute on my dashboard, I see the workers just keep generating data until they reach their memory thresholds, at which point they start writing data to disk, before `my_custom_function` ever gets called to relieve the memory buildup. Depending on the size of the problem and the speed of the disks where they are spilling, sometimes we can recover and manage to finish after a very long time. Usually the workers just stop working. This fail case is frustrating, because often I can achieve a reasonable result by just doing the naive thing: ``` for n in range(500): res[n].compute() ``` and evaluating my computation in serial. I wish the dask scheduler knew to stop generating new data before the downstream data could be consumed. I am not an expert, but I believe the term for this is [backpressure](https://www.quora.com/What-is-backpressure-in-the-context-of-data-streaming). I see this term has come up in https://github.com/dask/distributed/issues/641, and also in [this blog post](http://matthewrocklin.com/blog/work/2017/04/13/streaming) by @mrocklin regarding streaming data. I have a hunch that resolving this problem would resolve many of the pervasive but hard-to-diagnose problems we have in the xarray / pangeo sphere. But I also suspect it is not easy and requires major changes to core algorithms. Dask version 1.1.4

However, the bottom line is that it is just a very hard computational problem. There are two approaches you can take:

Throw a ton of memory at it

You cluster probably needs like 10x more memory than the data you are trying to process. So if you have 1 TB of data, you would need 10 TB of aggregate memory. You could use a cluster of 100 nodes with 100 GB RAM each. That might work

Rechunk your data

A similar issue became the most common thread on this forum:

What we ended up doing is creating a new package called rechunker whose job is just to scalably alter the chunk structure of big Zarr arrays.

If you rechunk your data to have a contiguous time dimension (no chunks in time) but instead use chunks in the spatial dimension, your problem becomes embarrassingly parallel. Then things should move very quickly.

Hope that’s helpful. Please report back because this sort of problem is very interesting to us.

Topic		Replies	Views
Optimizing climatology calculation with Xarray and Dask Science	33	4044	December 6, 2024
Best practice advice on parallel processing a suite of zarr files with Dask and xarray Data	10	271	June 18, 2025
xr.DataArray.chunks, np.digitize and xr.DataArray.groupby, and dask Science	2	676	January 16, 2022
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? HPC	45	17472	July 5, 2025
Extremely slow rechunking of Zarr store with xarray Data	16	4033	October 22, 2021

Zarr era5 reading causes huge number of tasks

The good news

The bad news

Throw a ton of memory at it

Rechunk your data

Related topics