I’m attempting to calculate some climatologies over an ERA5 dataset stored in google cloud storage stored as a zarr file using xarray/dask. I’m attempting to do this over individual chunks at a time for reasons of memory (Doing the entire operation at once crashes the machine). Whenever I open the file however and look at the task graph, dask creates a reading task for every chunk in the file, not just those accessed. Normally this is not a large issue as when the task gets submitted the graph gets optimized and the extra reads dropped. This is however a large number of tasks (42k in our case) per chunk I process.
However when attempting to iterate over chunks and submitting each of the processing steps, this bogs down the scheduler and also it seems like the read gets stuck and capped at one. Additionally instead of doing depth first and finishing up each chunk, it seems to want to read all of the chunks at once, before any processing starts, which tends to crash the node. I can generate a task graph if I restrict it to a 1 year file, but it has to plot it small enough to make it fairly hard to use.
This begs the question of the proper way to do what I am doing. Essentially I want to do the following:
import xarray as xr import gcsfs import numpy as np import dask as da import logging import warnings from datetime import datetime ds_all = xr.open_zarr(zarr_filename_in_google_bucket, consolidated=True) climatology_mean = ds_all.groupby("time.dayofyear").mean().compute()
But this tends to crash. I’ve also tried splitting it up by chunks which works, but can be incredibly slow as it essentially only uses a single CPU for everything. An example is at https://gist.github.com/josephhardinee/bb1d5cf91b55caa151e0b9096294bd4c
I also tried this with futures which I think may be the correct way, but either scheduler gets overwhelmed with # of tasks (Most of which aren’t needed as dependencies for each path down the tree) or one node somehow ends up grabbing all of the tasks. An example is at https://gist.github.com/josephhardinee/5e1b8da4764239a029c16cf4ceaaca8e
Is there something obvious I am doing wrong with these? The zarr dataset being loaded in is 400000x721x1440 chunked as (100000, 10, 10) if that is helpful.