Ok, this is a very important detail. Data pipelines that don’t change the chunk structure and are suitably local to each chunk are “embarrassingly parallel”. There are many task execution frameworks that would probably handle this well, and it would be great if Dask got better at these very large but relatively simple transformations.
This is a choice you can make in your algorithm. In your approach (“push”), you would loop over files, reading each one only once and potentially writing to multiple target chunks. You would have to worry about write locks or else face potential data corruption from asynchronous writes. Alternatively, for a “pull” approach, you would write each target chunk out only once, but potentially may have to read your input files over and over many times. As you said, especially on cloud, the concurrent reads should perform just fine.
For large, “full-shuffle” style rechunking, you’re likely to hit memory limits with an embarrassingly parallel approach. Some more computational complexity–either task dependencies or temporary storage–is likely required.
A Proposal
This problem is driving me crazy. I would say I have thought about it most nights as I try to fall asleep for the past year or longer. The worst part about this is that it’s not even an interesting problem. It’s a straightforward problem in blocked linear algebra for which an optimal algorithm, given i/o and memory limitations, is likely already known to computer science. Or it’s a problem that a student in the right sub-field would be able to solve very quickly, in terms of algorithm design. However, it can perhaps be made interesting by combining with a modern, asynchronous, serverless parallel execution framework.
We should collaborate with a computer scientist to solve this problem well. The end result should be a simple utility than can rechunk any dataset and, given a scheduler, execute the operation. We will be able to write a paper about how well our algorithm scales, and we will get a very useful tool out of the process.
As a CLI, it might look like
zarr_rechunk --from gs://bucket/group/array1 --to gs://bucket/group/array2 --target-chunks (-1, 1000, 2000)
I would love to investigate pywren / numpywren as en execution framework, as it would enable us to do things in a truly serverless way, using aws lambda / cloud functions.
(also numpywren)
If there are any computer scientists out there who find this problem compelling, please help us!