Zarr era5 reading causes huge number of tasks

josephhardinee · September 8, 2021, 5:09pm

I’m attempting to calculate some climatologies over an ERA5 dataset stored in google cloud storage stored as a zarr file using xarray/dask. I’m attempting to do this over individual chunks at a time for reasons of memory (Doing the entire operation at once crashes the machine). Whenever I open the file however and look at the task graph, dask creates a reading task for every chunk in the file, not just those accessed. Normally this is not a large issue as when the task gets submitted the graph gets optimized and the extra reads dropped. This is however a large number of tasks (42k in our case) per chunk I process.

However when attempting to iterate over chunks and submitting each of the processing steps, this bogs down the scheduler and also it seems like the read gets stuck and capped at one. Additionally instead of doing depth first and finishing up each chunk, it seems to want to read all of the chunks at once, before any processing starts, which tends to crash the node. I can generate a task graph if I restrict it to a 1 year file, but it has to plot it small enough to make it fairly hard to use.

This begs the question of the proper way to do what I am doing. Essentially I want to do the following:

import xarray as xr
import gcsfs
import numpy as np
import dask as da
import logging
import warnings
from datetime import datetime

ds_all = xr.open_zarr(zarr_filename_in_google_bucket, consolidated=True)
climatology_mean = ds_all.groupby("time.dayofyear").mean().compute()

But this tends to crash. I’ve also tried splitting it up by chunks which works, but can be incredibly slow as it essentially only uses a single CPU for everything. An example is at https://gist.github.com/josephhardinee/bb1d5cf91b55caa151e0b9096294bd4c

I also tried this with futures which I think may be the correct way, but either scheduler gets overwhelmed with # of tasks (Most of which aren’t needed as dependencies for each path down the tree) or one node somehow ends up grabbing all of the tasks. An example is at https://gist.github.com/josephhardinee/5e1b8da4764239a029c16cf4ceaaca8e

Is there something obvious I am doing wrong with these? The zarr dataset being loaded in is 400000x721x1440 chunked as (100000, 10, 10) if that is helpful.

rabernat · September 8, 2021, 7:44pm

The good news

There is a trick to solving the easy part of your problem.

# open the dataset with no dask chunking
ds_all = xr.open_zarr(zarr_filename_in_google_bucket, consolidated=True, chunks=False)
# select the day you want, lazy, but no dask involved
ds_day = ds_all.isel(time=0)
# now do what you want, including chunking, with your small piece of data

This is a poorly documented but very useful way to work with data. It’s how my llcbot works.
If you have your own system for parallelization, you could use that here to map over many tasks

The bad news

I have never seen code like this…

climatology_mean = ds_all.groupby("time.dayofyear").mean().compute()

…work with data that is chunked in time at the scale of ERA5.

I don’t know if your dataset is public, but here is what Pangeo’s ERA5 data looks like

from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/atmosphere.yaml")
ds = cat['era5_hourly_reanalysis_single_levels_sa'].to_dask()
display(ds)

Each variable 1.3 TB of data, in ~10_000 x 100 MB chunks along the time axis. There are 17 variables

The Dask graph that comes out of groupby("time.dayofyear").mean().compute() creates communication patterns that are extremely memory intensive, since it needs to combine data from every single chunk at every single point in space. There has been lots of work in Dask recently on improving memory management and task scheduling that has slowly improved this use case, which you can read about here:

github.com/dask/distributed

an example that shows the need for memory backpressure

opened 08:03PM - 18 Mar 19 UTC

closed 11:30PM - 30 Jun 21 UTC

rabernat

In my work with large climate datasets, I often concoct calculations that cause …my dask workers to run out of memory, start dumping to disk, and eventually grind my computation to a halt. There are many ways to mitigate this by e.g. using more workers, more memory, better disk-spilling settings, simpler jobs, etc. and these have all been tried over the years with some degree of success. But in this issue, I would like to address what I believe is the root of my problems within the dask scheduler algorithms. The core problem is that the tasks early in my graph generate data faster than it can be consumed downstream, causing data to pile up, eventually overwhelming my workers. Here is a self contained example: ```python import dask.array as dsa # create some random data # assume chunk structure is not under my control, because it originates # from the way the data is laid out in the underlying files shape = (500000, 100, 500) chunks = (100, 100, 500) data = dsa.random.random(shape, chunks=chunks) # now rechunk the data to permit me to do some computations along different axes # this aggregates chunks along axis 0 and dis-aggregates along axis 1 data_rc = data.rechunk((1000, 1, 500)) FACTOR = 15 def my_custom_function(f): # a pretend custom function that would do a bunch of stuff along # axis 0 and 2 and then reduce the data heavily return f.ravel()[::15][None, :] # apply that function to each chunk c1 = math.ceil(data_rc.ravel()[::FACTOR].size / c0) res = data_rc.map_blocks(my_custom_function, dtype=data.dtype, drop_axis=[1, 2], new_axis=[1], chunks=(1, c1)) res.compute() ``` (Perhaps this could be simplified further, but I have done my best to preserve the basic structure of my real problem.) When I watch this execute on my dashboard, I see the workers just keep generating data until they reach their memory thresholds, at which point they start writing data to disk, before `my_custom_function` ever gets called to relieve the memory buildup. Depending on the size of the problem and the speed of the disks where they are spilling, sometimes we can recover and manage to finish after a very long time. Usually the workers just stop working. This fail case is frustrating, because often I can achieve a reasonable result by just doing the naive thing: ``` for n in range(500): res[n].compute() ``` and evaluating my computation in serial. I wish the dask scheduler knew to stop generating new data before the downstream data could be consumed. I am not an expert, but I believe the term for this is [backpressure](https://www.quora.com/What-is-backpressure-in-the-context-of-data-streaming). I see this term has come up in https://github.com/dask/distributed/issues/641, and also in [this blog post](http://matthewrocklin.com/blog/work/2017/04/13/streaming) by @mrocklin regarding streaming data. I have a hunch that resolving this problem would resolve many of the pervasive but hard-to-diagnose problems we have in the xarray / pangeo sphere. But I also suspect it is not easy and requires major changes to core algorithms. Dask version 1.1.4

However, the bottom line is that it is just a very hard computational problem. There are two approaches you can take:

Throw a ton of memory at it

You cluster probably needs like 10x more memory than the data you are trying to process. So if you have 1 TB of data, you would need 10 TB of aggregate memory. You could use a cluster of 100 nodes with 100 GB RAM each. That might work

Rechunk your data

A similar issue became the most common thread on this forum:

What we ended up doing is creating a new package called rechunker whose job is just to scalably alter the chunk structure of big Zarr arrays.

If you rechunk your data to have a contiguous time dimension (no chunks in time) but instead use chunks in the spatial dimension, your problem becomes embarrassingly parallel. Then things should move very quickly.

Hope that’s helpful. Please report back because this sort of problem is very interesting to us.

josephhardinee · September 8, 2021, 11:03pm

Thanks for the help @rabernat. So I’m adding a few more details and maybe getting a hint of what is going wrong in my specific case. I opened a zarr file with just a single variable (and conveniently without time chunked). I open the file, grab a single chunk, then do the groupby.mean on that. A notebook that shows this can be found at https://nbviewer.jupyter.org/urls/dl.dropbox.com/s/lts06m3qchri50y/zarr_speed_tests.ipynb

I think maybe key to what is going on are the two task graphs generated in that notebook . First is the graph for just opening the file and performing the select. This can be seen at Dropbox - step1.svg - Simplify your life , though you may have to download the file to be able to zoom in enough on it. The general structure is fine, but we do see a lot of spare output nodes that are not actually being used. If we add the groupby and mean on top of this we get Dropbox - step2.svg - Simplify your life . This has the right topology (ie the groupby is isolated to its individual chunk), but we do have a lot of “vestigial” outputs. When handing in a lot of these types of tasks to a scheduler (ie wrap this operation in a function, then submit it as a delayed object), it seems like all of the reads get done before any of the computations. As an example if I pass 2 chunks into the operation at once I get Dropbox - step3.svg - Simplify your life . So it seems like the structure is getting set up appropriately, just with a lot of tasks that could perhaps be pruned. If I submit many of these to a distributed cluster at once, it seems like it wants to do all of the read operations, before it gets to any/many of the actual computation nodes that would let it offload results and intermediate nodes.

So I guess a question would be how I prune this graph to automatically get rid of some of the earlier nodes? I know dask has some documentation on this, but it seems mostly tailored to how to do it on hand generated graphs rather than the automatically generated ones.

As far as your first solution that may solve the other problem I’ve been running into but put on the backburner. Most people here are using time series over a few pixels so our chunking right now is 10 x10 in lat/lon, and cuts ERA5 up into 4 slices in time (~100k points each, except last which has a few less). So if I open it, with chunking off, then select the first time globally with isel(time=0), my understanding is it has to parse all lat/lon and a single time chunk for each, which in our case would be roughly 1/4 of the dataset? This part isn’t a huge problem as we have yearly zarr files as well which is what I was planning on using, but if there is a shorter way to make that work I would be interested.

scottyhq · September 9, 2021, 6:42pm

@rabernat @josephhardinee, complexities of climatologies and chunking aside, glancing at the notebook If I’m not mistaken, I don’t think the dask scheduler actually receives all these tasks reported in Xarray’s HTML depending on the calculation.

I typically use .visualize(optimize_graph=True) which I thought it what the scheduler actually receives (maybe @TomAugspurger could confirm)? I’ll try to illustrate below with a simple example and public data:

import os
import rioxarray
os.environ['AWS_NO_SIGN_REQUEST']='YES'
cog = 's3://sentinel-s1-rtc-indigo/tiles/RTC/1/IW/12/S/YJ/2016/S1B_20161121_12SYJ_ASC/Gamma0_VV.tif'
da = rioxarray.open_rasterio(cog, chunks=dict(x=1024,y=1024))
da

(NOTE 37 tasks needed to read the entire array)

But if you just want values from a single dask chunk:

da.isel(x=0,y=0).data.visualize()

We see all 37 tasks even though only one chunk needs to be accessed:

Instead use optimize_graph=True for a more accurate representation of this computation

da.isel(x=0,y=0).data.visualize(optimize_graph=True, rankdir='LR')

TomAugspurger · September 9, 2021, 6:57pm

Yeah I think that’s correct. (My only hesitation is around the recent changes to avoid materializing low-level graphs on the client, and instead ship high-level graphs, but I think that only applies to dataframes right now.). Either way, visualize(optimize_graph=True) is a better representation of what’s sent to the scheduler.

TomAugspurger · September 9, 2021, 6:59pm

I haven’t read through the posts closely, but I see references to groupby so it might we worth trying out @dcherian’s dask_groupby: https://github.com/dcherian/dask_groupby.

dcherian · September 9, 2021, 7:55pm

(100000, 10, 10)

How many years are in a single chunk? If its > 1 (larger is better) I bet dask_groupby.xarray.xarray_reduce will work extremely well.

If it is less than one, it won’t work well. For e.g. It will fail extremely quickly for the pangeo style chunking ;). That said, if the pangeo hourly data had chunksize=24 along time, then the current xarray strategy should be optimal for dayofyear grouping.

dcherian · September 9, 2021, 10:50pm

Thanks for the poke here @TomAugspurger .

I think I’ve found a better strategy for this kind of time grouping where the groups repeat at a large distance relative to chunksize (e.g. dayofyear repeats every 365-ish but data may have only 3 days in a chunk). That kind of thing fails with both dask_groupby and xarray’s current strategy.

gist.github.com

https://gist.github.com/dcherian/6ccd76d2a6eaadb7844d61d197a8b3db

groupby-shuffle.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8d01ba7b-5b97-44fb-a848-1318097b8844",
   "metadata": {},
   "source": [
    "# Smarter time grouping\n",
    "\n",
    "When grouping by datetime components like month, year, dayofyear etc. the grouping variable has patterns. We can exploit this for faster groupby reductions."

This file has been truncated. show original

Help with generalizing this idea is very welcome!

raybellwaves · September 16, 2021, 3:29am

@josephhardinee I know it doesn’t get to the core of your issue but could drop in a for loop over year and iterate over those, create temp files etc. I imagine dealing a year at a time will be pretty fast and hopefully you get the climatology. Avoids compute crashes etc. I haven’t taken a look it but I feel xarray beam takes this approach (xarray-beam/era5_climatology.py at main · google/xarray-beam · GitHub) albeit in parallel. That said this forum is for how to do big data analysis so thanks for the Q.

shoyer · September 22, 2021, 10:06pm

Xarray-Beam creates a high-level task graph in Apache Beam, rather than a bunch of individual tasks. So I wouldn’t say it uses the looping approach. That said, it was designed to solve exactly these sort of problems, so I guess it would handle this pretty well once you figured out how to get started.

Topic		Replies	Views
Xarray slow read on cluster Data machine-learning	4	174	November 3, 2024
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	504	December 3, 2024
xr.DataArray.chunks, np.digitize and xr.DataArray.groupby, and dask Science	2	672	January 16, 2022
Any suggestions for efficiently operating over windows of data? Data	4	1185	February 2, 2023
Extremely slow xarray/zarr writes Data	5	448	August 22, 2024

Zarr era5 reading causes huge number of tasks

The good news

The bad news

Throw a ton of memory at it

Rechunk your data

Related topics