Another rechunking algorithm

Hi all,

I decided to tackle the array rechunking puzzle in python brought up years ago in this thread. After reading the thread, I discovered the rechunker python package and figured I’d use this one assuming it was as efficient as reasonably possible. It’s a great package, but after looking through the code and ruminating on my use cases I decided to try my hand at a different implementation.

I wanted an implementation that would produce a python generator that could be used on-the-fly instead of having to save data to disk (and in the rechunker case produce intermediate files). The output of the iteration should return a tuple of slices (of the target chunk) and the data array associated with that chunk. As with the other rechunker python package, it allows the user to specify an amount of memory to utilize to make the rechunking process more efficient (i.e. require less reads and writes).

I’ve made an implementation with the code located here. My tests indicate that it’s pretty efficient, but I’d appreciate other people’s feedback.

Simple tests for number of reads and writes

shape = (30, 30)
source_chunk_shape = (5, 2)
target_chunk_shape = (2, 5)
itemsize = 4
max_mem = 40 * itemsize

To calculate the total number of chunks in an array:

n_chunks = calc_n_chunks(source_shape, source_chunk_shape) # 90

To determine the ideal number of reads (and max_mem) for going from the source to the target:

ideal_read_chunk_shape = calc_ideal_read_chunk_shape(source_chunk_shape, target_chunk_shape) # (10, 10)
ideal_read_chunk_mem = calc_ideal_read_chunk_mem(ideal_read_chunk_shape, itemsize) # 400

To calculate the number of reads (which is also the number of writes) using the simple brute force method that must iterate over every source chunk in every target chunk (for reference):

n_reads_simple = calc_n_reads_simple(source_shape, source_chunk_shape, target_chunk_shape) # 324

To calculate the number of reads and writes using my new algorithm:

n_reads, n_writes = calc_n_reads_rechunker(source_shape, source_chunk_shape, target_chunk_shape, itemsize, max_mem) # 216, 90

The rechunker
The function called “rechunker” takes all of the above parameters in addition to a function/method to extract the array chunks. The function to extract the array chunks needs to be able to input a tuple of slices that represent one specific chunk (e.g. (slice(0, 5), slice(0, 2)). As mentioned earlier, it produces a generator that yields the target tuple of slices to the array chunk.

I was creating these functions as part of another project, but I’m thinking about making this a separate python package.

Any constructive feedback is welcome.

4 Likes

Thanks for sharing this! This seems incredibly similar to my suggestion here: In-memory rechunk · Issue #502 · cubed-dev/cubed · GitHub.

cc @tomwhite

1 Like

Thanks for sharing.

I wanted an implementation that would produce a python generator that could be used on-the-fly instead of having to save data to disk

I’m not sure if it’s the same and rechunker isn’t the most maintained these days, but it does have an in-memory Python executor: rechunker/rechunker/executors/python.py at master · pangeo-data/rechunker · GitHub. That could be used with an in-memory Zarr store to do everything in memory.

Hi Tom,

I don’t think that’s the same as I’m describing. From what I understand, rechunker puts the entire dataset into a zarr in-memory store, while I only want it to utilize a small fixed amount of memory. The small fixed memory is used at each rechunking step when iterating through the chunks…that does sound confusing…but it’s equivalent to the max_mem in rechunker.

I’ll update my original post to (hopefully) be more clear.

Thanks for the clarification! Yes that does seem like a crucial difference.

The cubed issue I liked is still relevant though - plugging your algorithm into cubed could be very powerful, especially if it can be efficiently parallelized.

You also might be interested in dask’s algorithm for rechunking in constant memory. (Note I said constant memory, not bounded memory like yours is. IIUC dask completes the whole rechunk using a constant amount of memory irrespective of the number of chunks, but you don’t know what that constant is until you’re running it.)

@mullenkamp also see this discussion for further context

Thanks Tom! I did see this post a couple days ago. Based on my own work, if you calculated the ideal chunking size from the source to target then the rechunking algorithm is actually pretty straightforward. Depending on the chunking shapes, the memory size might not be too huge…but it could be… I have this in my code as well, but I also have the condition that the user might not select a memory size that can fit the ideal chunking size.

My current implementation requires that the source shape and the target shape are the same (for example, both must be (30, 30)). I’m currently working on a modification that allows the user to select a subset of the source to produce a target shape that is smaller than the source.

Oh, and to reduce the confusion between my algorithm and the rechunker python package, I changed the name of the module to rechunkit :wink: .

I also fixed a small issue:

The function is still called rechunker.

Here’s a simple example using numpy arrays (and the parameters from my earlier messages):

import numpy as np

shape = (30, 30)
source_chunk_shape = (5, 2)
target_chunk_shape = (2, 5)
itemsize = 4
max_mem = 40 * itemsize # This is smaller than what's needed for the ideal chunk size

source = np.arange(1, prod(shape) + 1, dtype=dtype).reshape(shape)
source_get = source.__getitem__

target = np.zeros(shape, dtype=dtype)

for write_chunk, data in rechunker(source_get, shape, dtype, source_chunk_shape, target_chunk_shape, max_mem):
    target[write_chunk] = data

np.all(target == source) # Should be True!



1 Like

Somehow I managed to update the function to allow the target to be a subset of the source:

Here are a couple simple examples using numpy arrays:

import numpy as np

shape = (31, 31)
source_chunk_shape = (5, 2)
target_chunk_shape = (2, 5)
itemsize = 4
max_mem = 40 * itemsize # This is smaller than what's needed for the ideal chunk size

sel = (slice(3, 21), slice(11, 25))

source = np.arange(1, prod(shape) + 1, dtype=dtype).reshape(shape)
source_get = source.__getitem__

## When the source and target are the same shape
target = np.zeros(shape, dtype=dtype)
for write_chunk, data in rechunker(source_get, shape, dtype, source_chunk_shape, target_chunk_shape, max_mem):
    target[write_chunk] = data

np.all(source == target) # Should be True!

n_reads_simple = calc_n_reads_simple(source_shape, source_chunk_shape, target_chunk_shape) # 361 (with the same number of writes)

n_reads, n_writes = calc_n_reads_rechunker(source_shape, source_chunk_shape, target_chunk_shape, itemsize, max_mem) # 247, 112

## When the target is a subset of the source
target = np.zeros(source_shape, dtype=dtype)[sel]
for write_chunk, data in rechunker(source_get, shape, dtype, source_chunk_shape, target_chunk_shape, max_mem, sel):
    target[write_chunk] = data

np.all(source(sel) == target)

n_reads, n_writes = calc_n_reads_rechunker(source_shape, source_chunk_shape, target_chunk_shape, itemsize, max_mem) # 72, 27

I decided to publish the package in pypi and make a dedicated repo:

Let me know if anyone has feedback.