Thanks for the clarification! Yes that does seem like a crucial difference.
The cubed issue I liked is still relevant though - plugging your algorithm into cubed could be very powerful, especially if it can be efficiently parallelized.
You also might be interested in dask’s algorithm for rechunking in constant memory. (Note I said constant memory, not bounded memory like yours is. IIUC dask completes the whole rechunk using a constant amount of memory irrespective of the number of chunks, but you don’t know what that constant is until you’re running it.)