Cubed: Fixed-memory serverless distributed N-dimensional array processing

rabernat · May 24, 2022, 1:28pm

I’m sharing a new project that @tomwhite recently brought to my attention. This should be of great interest to many in our community. It builds on the approach we developed with Rechunker…

Cubed is a distributed N-dimensional array library implemented in Python using fixed-memory serverless processing and Zarr for storage.

Implements the Python Array API standard

Guaranteed maximum memory usage for standard array functions

Zarr for storage

Multiple serverless runtimes: Python, Apache Beam, Lithops

Motivation

Managing memory is one of the major challenges in designing and running distributed systems. Computation frameworks like Apache Hadoop’s MapReduce, Apache Spark, Beam, and Dask all provide a general-purpose processing model, which has lead to their widespread adoption. Successful use at scale however requires the user to carefully configure memory for worker nodes, and to understand how work is allocated to workers, which breaks the high-level programming abstraction. A disproportionate amount of time is often spent tuning the memory configuration of a large computation.

A common theme here is that most interesting computations are not embarrassingly parallel, but involve shuffling data between nodes. A lot of engineering effort has been put into optimizing the shuffle in Hadoop, Spark, Beam (Google Dataflow), and to a lesser extent Dask. This has undoubtedly improved performance, but has not made the memory problems go away.

Another approach has started gaining traction in the last few years. Lithops (formerly Pywren) and Rechunker, eschew centralized systems like the shuffle, and do everything via serverless cloud services and cloud storage.

Rechunker is interesting, since it implements a very targeted use case (rechunking persistent N-dimensional arrays), using only stateless (serverless) operations, with guaranteed memory usage. Even though it can run on systems like Beam and Dask, it deliberately avoids passing array chunks between worker nodes using the shuffle. Instead, all bulk data operations are reads from, or writes to, cloud storage (Zarr in this case). Since chunks are always of known size it is possible to tightly control memory usage, thereby avoiding unpredictable memory use at runtime.

This project is an attempt to go further: implement all distributed array operations using a fixed-memory serverless model.

Design

Cubed is composed of five layers: from the storage layer at the bottom, to the Array API layer at the top:

For more see the full project README.

RichardScottOZ · May 24, 2022, 11:56pm

Thanks Ryan. When I looked at lithops a while back it didn’t quite suit I thik.

Topic		Replies	Views
Pangeo Showcase: "Cubed: Bounded-Memory Serverless Array Processing in Xarray" Pangeo Showcase	0	687	November 13, 2023
Blog post: Optimizing Cubed News & Announcements	0	200	April 3, 2024
Rechunking large data at constant memory in Dask [experimental] HPC	9	1975	June 5, 2024
New Working Group for Distributed Array Computing News & Announcements	58	4249	February 3, 2025
Cloud array storage solutions Data	3	1192	November 29, 2023

Cubed: Fixed-memory serverless distributed N-dimensional array processing

Motivation

Design

Related topics