I think we’re actually very close to this already using Cubed. You can already run Cubed today on a single machine utilizing all cores (through threads or processes), and it internally creates a query plan that it optimizes.
By default, it would keep intermediate arrays in RAM; unless this would result in running out of RAM. In which case it’d borrow from Cubed’s approach, and write intermediate data to disk.
Doing the rechunk in-memory is the only part that’s missing. A while ago I raised an issue suggesting how Cubed could be generalized to do exactly that. One way to implement that idea might be to have some kind of special in-memory zarr store or implementation of the rechunk
primitive that did the rechunking in memory in rust as fast as possible. You could potentially steal whatever algorithm it is that dask uses to do rechunks now. This could be a very impactful but still tightly-scoped thing for you to write.
In which case it’d borrow from Cubed’s approach, and write intermediate data to disk.
This is the same idea as the “memory hierarchy” analogy @tomwhite made in that issue I just linked. It could also alternatively be implemented by spilling to disk during big rechunks like dask now does.
But I’m guessing that calling Python functions from Rust might kill performance?
If you use the kwarg dask='allowed'
(this is now badly-named, IIRC the same kwarg works the same way for Cubed too) then the responsibility of mapping the python function over the chunks is handed off to the function. In other words, you only call the python function once per apply_ufunc
call, not once per chunk. This should mean any cost of calling rust from python would amortize with respect to scale.
“hiding” the chunking from
xarray
, by only exposing the Python Array API toxarray
?
You can also do this - that’s what Arkouda does - but I suspect here it will mean you end up needing to rewrite much of Dask/Cubed’s logic for handling chunked arrays in your rust library. I suggest starting with the rechunk approach I described above.
Also note that you don’t actually have to understand xarray’s API for alternative chunked array types or other internals in order to try out any of this. The MVP can be written entirely in terms of Cubed’s Array API, cubed.from_zarr
, and cubed.rechunk
- the wrapping by xarray isn’t the performance-critical part.