Another rechunking algorithm

mullenkamp · July 1, 2025, 10:32pm

Hi all,

I decided to tackle the array rechunking puzzle in python brought up years ago in this thread. After reading the thread, I discovered the rechunker python package and figured I’d use this one assuming it was as efficient as reasonably possible. It’s a great package, but after looking through the code and ruminating on my use cases I decided to try my hand at a different implementation.

I wanted an implementation that would produce a python generator that could be used on-the-fly instead of having to save data to disk (and in the rechunker case produce intermediate files). The output of the iteration should return a tuple of slices (of the target chunk) and the data array associated with that chunk. As with the other rechunker python package, it allows the user to specify an amount of memory to utilize to make the rechunking process more efficient (i.e. require less reads and writes).

I’ve made an implementation with the code located here. My tests indicate that it’s pretty efficient, but I’d appreciate other people’s feedback.

Simple tests for number of reads and writes

shape = (30, 30)
source_chunk_shape = (5, 2)
target_chunk_shape = (2, 5)
itemsize = 4
max_mem = 40 * itemsize

To calculate the total number of chunks in an array:

n_chunks = calc_n_chunks(source_shape, source_chunk_shape) # 90

To determine the ideal number of reads (and max_mem) for going from the source to the target:

ideal_read_chunk_shape = calc_ideal_read_chunk_shape(source_chunk_shape, target_chunk_shape) # (10, 10)
ideal_read_chunk_mem = calc_ideal_read_chunk_mem(ideal_read_chunk_shape, itemsize) # 400

To calculate the number of reads (which is also the number of writes) using the simple brute force method that must iterate over every source chunk in every target chunk (for reference):

n_reads_simple = calc_n_reads_simple(source_shape, source_chunk_shape, target_chunk_shape) # 324

To calculate the number of reads and writes using my new algorithm:

n_reads, n_writes = calc_n_reads_rechunker(source_shape, source_chunk_shape, target_chunk_shape, itemsize, max_mem) # 216, 90

The rechunker
The function called “rechunker” takes all of the above parameters in addition to a function/method to extract the array chunks. The function to extract the array chunks needs to be able to input a tuple of slices that represent one specific chunk (e.g. (slice(0, 5), slice(0, 2)). As mentioned earlier, it produces a generator that yields the target tuple of slices to the array chunk.

I was creating these functions as part of another project, but I’m thinking about making this a separate python package.

Any constructive feedback is welcome.

TomNicholas · July 2, 2025, 1:25am

Thanks for sharing this! This seems incredibly similar to my suggestion here: In-memory rechunk · Issue #502 · cubed-dev/cubed · GitHub.

cc @tomwhite

TomAugspurger · July 2, 2025, 3:38pm

Thanks for sharing.

I wanted an implementation that would produce a python generator that could be used on-the-fly instead of having to save data to disk

I’m not sure if it’s the same and rechunker isn’t the most maintained these days, but it does have an in-memory Python executor: rechunker/rechunker/executors/python.py at master · pangeo-data/rechunker · GitHub. That could be used with an in-memory Zarr store to do everything in memory.

mullenkamp · July 2, 2025, 9:06pm

Hi Tom,

I don’t think that’s the same as I’m describing. From what I understand, rechunker puts the entire dataset into a zarr in-memory store, while I only want it to utilize a small fixed amount of memory. The small fixed memory is used at each rechunking step when iterating through the chunks…that does sound confusing…but it’s equivalent to the max_mem in rechunker.

I’ll update my original post to (hopefully) be more clear.

TomNicholas · July 3, 2025, 2:37pm

Thanks for the clarification! Yes that does seem like a crucial difference.

The cubed issue I liked is still relevant though - plugging your algorithm into cubed could be very powerful, especially if it can be efficiently parallelized.

You also might be interested in dask’s algorithm for rechunking in constant memory. (Note I said constant memory, not bounded memory like yours is. IIUC dask completes the whole rechunk using a constant amount of memory irrespective of the number of chunks, but you don’t know what that constant is until you’re running it.)

TomNicholas · July 3, 2025, 2:39pm

@mullenkamp also see this discussion for further context

mullenkamp · July 3, 2025, 8:48pm

Thanks Tom! I did see this post a couple days ago. Based on my own work, if you calculated the ideal chunking size from the source to target then the rechunking algorithm is actually pretty straightforward. Depending on the chunking shapes, the memory size might not be too huge…but it could be… I have this in my code as well, but I also have the condition that the user might not select a memory size that can fit the ideal chunking size.

My current implementation requires that the source shape and the target shape are the same (for example, both must be (30, 30)). I’m currently working on a modification that allows the user to select a subset of the source to produce a target shape that is smaller than the source.

Oh, and to reduce the confusion between my algorithm and the rechunker python package, I changed the name of the module to rechunkit .

I also fixed a small issue:

github.com/mullenkamp/cfdb

cfdb/rechunkit.py

c58b8a273

"""Core rechunking algorithm stuff."""
# import copy
from typing import List, Optional, Sequence, Tuple, Iterator, Generator
import numpy as np
import itertools
# from time import time
from math import prod, lcm, ceil
from collections import Counter, deque
from itertools import count

########################################################
### Functions


def guess_chunk_shape(shape: Tuple[int, ...], dtype: np.dtype, target_chunk_size: int = 2**21) -> Tuple[int, ...]:
    """
    Guess an appropriate chunk layout for a dataset, given its shape and
    the size of each element in bytes.  Will allocate chunks only as large
    as target_chunk_size.  Chunks are generally close to some power-of-2 fraction of
    each axis, slightly favoring bigger values for the last index.

This file has been truncated. show original

The function is still called rechunker.

Here’s a simple example using numpy arrays (and the parameters from my earlier messages):

import numpy as np

shape = (30, 30)
source_chunk_shape = (5, 2)
target_chunk_shape = (2, 5)
itemsize = 4
max_mem = 40 * itemsize # This is smaller than what's needed for the ideal chunk size

source = np.arange(1, prod(shape) + 1, dtype=dtype).reshape(shape)
source_get = source.__getitem__

target = np.zeros(shape, dtype=dtype)

for write_chunk, data in rechunker(source_get, shape, dtype, source_chunk_shape, target_chunk_shape, max_mem):
    target[write_chunk] = data

np.all(target == source) # Should be True!

mullenkamp · July 5, 2025, 1:05am

Somehow I managed to update the function to allow the target to be a subset of the source:

github.com/mullenkamp/cfdb

cfdb/rechunkit.py

dd63942a1

"""Core rechunking algorithm stuff."""
# import copy
from typing import List, Optional, Sequence, Tuple, Iterator, Generator
import numpy as np
import itertools
# from time import time
from math import prod, lcm, ceil
from collections import Counter, deque
from itertools import count

########################################################
### Functions


def guess_chunk_shape(shape: Tuple[int, ...], dtype: np.dtype, target_chunk_size: int = 2**21) -> Tuple[int, ...]:
    """
    Guess an appropriate chunk layout for a dataset, given its shape and
    the size of each element in bytes.  Will allocate chunks only as large
    as target_chunk_size.  Chunks are generally close to some power-of-2 fraction of
    each axis, slightly favoring bigger values for the last index.

This file has been truncated. show original

Here are a couple simple examples using numpy arrays:

import numpy as np

shape = (31, 31)
source_chunk_shape = (5, 2)
target_chunk_shape = (2, 5)
itemsize = 4
max_mem = 40 * itemsize # This is smaller than what's needed for the ideal chunk size

sel = (slice(3, 21), slice(11, 25))

source = np.arange(1, prod(shape) + 1, dtype=dtype).reshape(shape)
source_get = source.__getitem__

## When the source and target are the same shape
target = np.zeros(shape, dtype=dtype)
for write_chunk, data in rechunker(source_get, shape, dtype, source_chunk_shape, target_chunk_shape, max_mem):
    target[write_chunk] = data

np.all(source == target) # Should be True!

n_reads_simple = calc_n_reads_simple(source_shape, source_chunk_shape, target_chunk_shape) # 361 (with the same number of writes)

n_reads, n_writes = calc_n_reads_rechunker(source_shape, source_chunk_shape, target_chunk_shape, itemsize, max_mem) # 247, 112

## When the target is a subset of the source
target = np.zeros(source_shape, dtype=dtype)[sel]
for write_chunk, data in rechunker(source_get, shape, dtype, source_chunk_shape, target_chunk_shape, max_mem, sel):
    target[write_chunk] = data

np.all(source(sel) == target)

n_reads, n_writes = calc_n_reads_rechunker(source_shape, source_chunk_shape, target_chunk_shape, itemsize, max_mem) # 72, 27

mullenkamp · July 8, 2025, 12:38am

I decided to publish the package in pypi and make a dedicated repo:

Let me know if anyone has feedback.

Topic		Replies	Views
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? HPC	45	17803	July 5, 2025
Extremely slow rechunking of Zarr store with xarray Data	16	4252	October 22, 2021
Rechunking large data at constant memory in Dask [experimental] HPC	9	2169	June 5, 2024
Am I thinking about this data processing/chunking workflow correctly? Data	8	1089	June 9, 2023
Any suggestion how to avoid memory error when using rechunker? Data machine-learning	0	433	February 2, 2023

Another rechunking algorithm

Related topics