Xarray unable to allocate memory, How to "size up" problem

I’m glad we were able to help you make progress!

I also really take your point that the user experience here is a far from easy. An ideal tool would simply “just work” for any data size, any complexity, utilizing the available compute resources to their maximum potential. This is unfortunately not where Xarray + Dask are today. There is still a lot of work to be done!

I guess the key question is, how easy do we expect it to be? What other tools out there can take 100 GB of data spread over 73 netCDF files (with inconsistent coordinates) and compute the quantity of interest (the maximum daily air temperature anomaly at each location)? The options I see are:

  • Roll your own lower level data processing code, e.g. manually load each file in a for loop, code up the aggregations yourself, etc. This seems to be the default fallback you had in mind if Xarray / Dask did not work. If you spend 5 days tweaking an Xarray / Dask calculation to actually work, this option no longer seems so bad.
  • NCO - did you try it?
  • CDO - did you try it?
  • Xarray Beam (ERA5 climatology example)

I think this particular problem involving climatologies is especially nasty and has no easy solutions. In the meantime, it’s an excellent use case for benchmarking.

My takeaway from many similar interactions over the past few years is that people love the Xarray API (the code you write is really simple and intuitive) but people hate debugging crashing Dask clusters, picking chunks, tweaking worker memory, etc. There have been some big recent improvements to the Dask scheduler, and I’m curious whether you’re running the latest Dask version. But these are not a silver bullet. Exploring alternative execution frameworks is a main focus of the New Working Group for Distributed Array Computing. Hopefully some insights and new approaches will come out of that.

1 Like