Hi all,
I decided to tackle the array rechunking puzzle in python brought up years ago in this thread. After reading the thread, I discovered the rechunker python package and figured I’d use this one assuming it was as efficient as reasonably possible. It’s a great package, but after looking through the code and ruminating on my use cases I decided to try my hand at a different implementation.
I wanted an implementation that would produce a python generator that could be used on-the-fly instead of having to save data to disk (and in the rechunker case produce intermediate files). The output of the iteration should return a tuple of slices (of the target chunk) and the data array associated with that chunk. As with the other rechunker python package, it allows the user to specify an amount of memory to utilize to make the rechunking process more efficient (i.e. require less reads and writes).
I’ve made an implementation with the code located here. My tests indicate that it’s pretty efficient, but I’d appreciate other people’s feedback.
Simple tests for number of reads and writes
shape = (30, 30)
source_chunk_shape = (5, 2)
target_chunk_shape = (2, 5)
itemsize = 4
max_mem = 40 * itemsize
To calculate the total number of chunks in an array:
n_chunks = calc_n_chunks(source_shape, source_chunk_shape) # 90
To determine the ideal number of reads (and max_mem) for going from the source to the target:
ideal_read_chunk_shape = calc_ideal_read_chunk_shape(source_chunk_shape, target_chunk_shape) # (10, 10)
ideal_read_chunk_mem = calc_ideal_read_chunk_mem(ideal_read_chunk_shape, itemsize) # 400
To calculate the number of reads (which is also the number of writes) using the simple brute force method that must iterate over every source chunk in every target chunk (for reference):
n_reads_simple = calc_n_reads_simple(source_shape, source_chunk_shape, target_chunk_shape) # 324
To calculate the number of reads and writes using my new algorithm:
n_reads, n_writes = calc_n_reads_rechunker(source_shape, source_chunk_shape, target_chunk_shape, itemsize, max_mem) # 216, 90
The rechunker
The function called “rechunker” takes all of the above parameters in addition to a function/method to extract the array chunks. The function to extract the array chunks needs to be able to input a tuple of slices that represent one specific chunk (e.g. (slice(0, 5), slice(0, 2)). As mentioned earlier, it produces a generator that yields the target tuple of slices to the array chunk.
I was creating these functions as part of another project, but I’m thinking about making this a separate python package.
Any constructive feedback is welcome.