Why learn rust? A pangeo perspective

apurba-biswas · November 1, 2024, 1:24pm

As someone who’s primarily working with Python and SQL, to manipulate climate data using xarray,I wanted to know how an entry-level climate data engineer may leverage Rust, either immediately or in a couple of years

I’ve seen Rust mentioned a couple times in this space as “plug-ins” to speedup operations. I’d like to develop my skills towards contributing to the Pangeo packages and was wondering what sort of appetite exists for the application of Rust.

To boil it down, does Pangeo need Rust? If so, how? What does it offer? Any articles and opinions are welcome

ThomasMGeo · November 2, 2024, 2:15am

You can ‘use’ rust by using Polars.

Building a rust backend to Xarray would be an interesting use case. There is a rust reader for grib.

gadomski · November 3, 2024, 8:26pm

I wrote up some thoughts on Rust and geospatial here: gadom.ski • Rust and geospatial

Tl;dr: it’s great for improving “solved” problems (e.g. implementing specs) and making tools/apps. Might be less useful for data analysis, where you can continue to use things like polars and stay in Python. GitHub - developmentseed/obstore: Simple, fast integration with Amazon S3, Google Cloud Storage, Azure Storage, and S3-compliant APIs like Cloudflare R2 is another (newer) example of Python tooling that uses Rust to enhance performance.

weiji14 · November 3, 2024, 9:38pm

Besides gribberish, there are several experiment Rust-based backend engines for xarray now for different file formats, e.g:

HDF5/NetCDF
- GitHub - gauteh/hidefix: 🐕 Concurrent HDF5 and NetCDF4 reader
- GitHub - SlideRuleEarth/h5coro: The HDF5 Cloud Optimized Read Only Python Package (only specific types of HDF5 files, like ICESat-2 HDF5 files)
GeoTIFF
- GitHub - weiji14/cog3pio: Cloud-optimized GeoTIFF ... Parallel I/O 🦀

These libraries adds an xarray BackendEntrypoint so that you can do xarray.open_dataset(..., engine="name-of-rust-backend") to open those datasets using a Rust-based engine, but into Python.

There are also some Rust-based I/O libraries/crates for Zarr stores that may be more relevant for climate science:

People are usually excited with Rust because it can (sometimes, not always) speed up performance via more efficient multi-threading/async operations, see e.g. Announcing Icechunk! | Earthmover where Icechunk is faster in reading Zarr v3 than with using Dask. This isn’t true for everything though, there are some core libraries written in low-level languages like C/Fortran that have Python/R bindings (e.g. GDAL, PROJ, etc) that wouldn’t get much of a speedup if you rewrite it in Rust (though there are other benefits like memory safety).

That said, for newer libraries (e.g. the Zarr v3 readers above), I’d like to think that it can be good to use Rust instead of C/C++, because it’s relatively easy to create bindings in Python/R/Julia/etc to a core Rust library. The GeoRust organization is doing a good job at implementing many core geospatial operations in Rust, mostly for vector points/lines/polygons, but I’m hoping that raster/n-dimensional datasets get their fair share of efficient Rust tooling once more people become familiar with the language.

jack_kelly · March 6, 2025, 6:26pm

Do folks believe there would be any benefits to implementing any parts of xarray’s computational machinery in Rust?

The particular use-case I have in mind is processing large out-of-core chunked datasets (like numerical weather predictions) on a single machine, where we want to make full use of all CPU cores, but we also want to be as CPU-efficient and memory-efficient as possible, and to have minimal start-up latencies. And where we want to interleave IO with compute.

Perhaps this would already be possible by using xarray’s “flexible arrays” (based on Numpy’s “duck arrays”)? i.e. we wouldn’t need to “re-write” any of the core xarray code, instead we’d have new Rust code (with a Python API) that slots neatly into xarray’s backend APIs for flexible arrays and indexing?

Is there any point to this?!

I’ve heard it said the xarray can be quite slow for some use-cases, but I haven’t done any rigorous benchmarking myself. I’ve also heard people say that the last thing the community needs is yet another framework for computation!

So I’m very keen to hear other peoples’ views!

FWIW, I asked Gemini “Are there any specific problems with xarray that could be solved by writing xarray backends in Rust?”. Gemini certainly “thinks” there’s useful work to be done; especially around indexing, alignment, chunked array operations, apply_ufunc, memory management, parallelizing internal loops, and interleaving IO with compute. (But, as we all know, it’s dangerous to fully believe what LLMs tell us!) Here’s Gemini’s 3-page response.

TomNicholas · March 6, 2025, 7:37pm

implementing any parts of xarray ’s computational machinery in Rust?

i.e. we wouldn’t need to “re-write” any of the core xarray code, instead we’d have new Rust code (with a Python API)

You could do that, but xarray’s flexible arrays means that you’re basically suggesting rewriting numpy in rust.

I’ve heard it said the xarray can be quite slow for some use-cases

Statements like this aren’t super informative because “xarray” is really xarray + a backend for understanding a format (e.g. h5netcdf/zarr-python) + a way to get at remote chunks (e.g. fsspec) + a DAG (e.g. dask/cubed) + an algorithm represented in the DAG (e.g. flox) + a DAG scheduler (dask.distributed) the actual low-level computation (e.g. numpy/cupy).

I’ve also heard people say that the last thing the community needs is yet another framework for computation !

Interesting - IMO there should be several options for computation to promote competition, rather than relying on dask all the way down.

In most cases (certainly in cloud workflows) the actual bottleneck in xarray workflows is IO, not compute, which is why the rust core of Icechunk can deliver big speedups.

writing xarray backends in Rust?

If you rewrote an xarray backend in rust the best you could do would likely be to approach the performance of using VirtualiZarr with icechunk, as Icechunk with virtual chunks is effectively a general rust-powered async backend for reading any filetype that can be virtualized. This is focused on object storage though.

The other place that stuff tends to go wrong is the dask graph. But again that’s not solved by rewriting the numpy functions that dask is calling.

IIUC there is little reason to think that rust is faster than numpy, because both use low-level languages under the hood.

But having said all that, a rust array library that was lazy, and hence internally had a query plan that could be optimized, could be much faster. The use case would be for when you don’t want the overhead of dask, but you also don’t want every step in the array computation to be eagerly computed in the way it likely would with numpy. There’s some possibility for crossover with Cubed’s ability to run on a single machine here to add a model of memory management, allowing for out-of-core. That is similar to the use case you’re describing.

Something you didn’t mention that could be powerful would be implementing any new xarray “Flexible Indexes” (e.g. RangeIndex) in rust.

As always any effort like this should start with profiling a simple case.

dcherian · March 6, 2025, 9:05pm

A couple of us have worked on GitHub - numbagg/numbagg: Fast N-dimensional aggregation functions with Numba to do exactly this. It’s wired up to most Xarray reductions and will be used if installed.

What computations are you actually running?

IIUC Numba is not good when you need to write a custom data structure for the computation you want (e.g. t-digest, sliding window median). This is something I’ve wanted to explore with rust but haven’t had a chance to yet.

EDITS: oops, i missed the “out-of-core” piece, for in-memory data, numbagg is great.

jack_kelly · March 7, 2025, 12:59pm

Thank you Tom & Deepak for your quick replies!

Yeah, you said it better than I did! This is exactly what I have in mind (sorry I didn’t articulate myself very well!)

And this is all very much inspired by Dask, Cubed, xarray, and the vision outlined by Tom N & Tom W in their Cubed blog post, where xarray could be thought of like SQL, and the heavy lifting can be deligated to specialised code.

(Although I should say that, right now, I don’t have the capacity to actually implement any of this! I’m just curious. Although, if folks think it could be useful, then I’ll definitely try to free up capacity to work on this…)

I think that what I have in mind is something that exposes a Python Array API, and so can be used in xarray as a duck array. The implementation would be 100% focused on processing out-of-core multi-dimensional arrays efficiently and quickly on a single machine by utilising all CPU cores, and submitting concurrent read requests to IO. As you say, Tom, it would internally create, optimise, and execute a query plan. It would interleave IO with compute. By default, it would keep intermediate arrays in RAM; unless this would result in running out of RAM. In which case it’d borrow from Cubed’s approach, and write intermediate data to disk.

Please may I ask a potentially very naive question!? I note that the API for alternative xarray chunked array types includes several methods which take Python functions as arguments (e.g. apply_gufunc(func: Callable, ...)). PyO3 allows Rust to call Python functions. But I’m guessing that calling Python functions from Rust might kill performance? So would we achieve better performance by “hiding” the chunking from xarray, by only exposing the Python Array API to xarray? (This “feels” wrong; but I don’t know enough about xarray to know for sure!) I’m guessing this would mean that the new Rust compute module would have to talk directly to IO (e.g. zarrs).

And, having said all this, it might be best to wait for Python’s free-threading to stabilise, so we can easily expoit all CPU cores from a single process, implemented entirely in Python code!

TomNicholas · March 15, 2025, 11:02pm

I think we’re actually very close to this already using Cubed. You can already run Cubed today on a single machine utilizing all cores (through threads or processes), and it internally creates a query plan that it optimizes.

By default, it would keep intermediate arrays in RAM; unless this would result in running out of RAM. In which case it’d borrow from Cubed’s approach, and write intermediate data to disk.

Doing the rechunk in-memory is the only part that’s missing. A while ago I raised an issue suggesting how Cubed could be generalized to do exactly that. One way to implement that idea might be to have some kind of special in-memory zarr store or implementation of the rechunk primitive that did the rechunking in memory in rust as fast as possible. You could potentially steal whatever algorithm it is that dask uses to do rechunks now. This could be a very impactful but still tightly-scoped thing for you to write.

In which case it’d borrow from Cubed’s approach, and write intermediate data to disk.

This is the same idea as the “memory hierarchy” analogy @tomwhite made in that issue I just linked. It could also alternatively be implemented by spilling to disk during big rechunks like dask now does.

But I’m guessing that calling Python functions from Rust might kill performance?

If you use the kwarg dask='allowed' (this is now badly-named, IIRC the same kwarg works the same way for Cubed too) then the responsibility of mapping the python function over the chunks is handed off to the function. In other words, you only call the python function once per apply_ufunc call, not once per chunk. This should mean any cost of calling rust from python would amortize with respect to scale.

“hiding” the chunking from xarray , by only exposing the Python Array API to xarray ?

You can also do this - that’s what Arkouda does - but I suspect here it will mean you end up needing to rewrite much of Dask/Cubed’s logic for handling chunked arrays in your rust library. I suggest starting with the rechunk approach I described above.

Also note that you don’t actually have to understand xarray’s API for alternative chunked array types or other internals in order to try out any of this. The MVP can be written entirely in terms of Cubed’s Array API, cubed.from_zarr, and cubed.rechunk - the wrapping by xarray isn’t the performance-critical part.

jack_kelly · March 18, 2025, 5:14pm

Ooh, that’s very cool! When running on a single machine, do the different processes share data by writing intermediate arrays to disk? If so, does that suggest that it might be possible to go faster than cubed?

If cubed already performs as quickly and as efficiently as the hardware will allow then there’s probably no point in me even thinking about building a multi-dimensional array compute engine in Rust !

Oooh, I like this idea! I’ve been thinking that rechunking is a good-sized-problem to start with (although I appreciate it’s surprisingly tricky!). Is there a public “backend API” in Cubed that I could hook into? Either way, I might start with a standalone re-chunker, so I can get my head around the problem.

Ooh, that’s nice!

jack_kelly · March 18, 2025, 5:40pm

I’ve been wondering if it might be possible to use Apache DataFusion for processing labelled multi-dimensional arrays (including rechunking). Such a project would require quite a lot of custom Rust code to be written. But probably a lot less custom code than starting from scratch.

What is DataFusion? To quote their website:

DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format.
…
DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

DataFusion runs on a single machine.

DataFusion is primarily used for tabular data (rather than multi-dimensional arrays). But Arrow defines a Tensor type for multi-dimensional arrays (which is implemented in arrow-rs and PyArrow).

I’m by no means certain but I think it should be possible to implement something like an xarray backend in DataFusion. The advantage being that DataFusion already includes a lot of the mechanisms required for optimising queries, and executing queries across all CPU cores, whilst respecting RAM limits.

maximedion2 has started work on a Zarr reader for DataFusion (that will soon usezarrs).

Does this sound even vaguely sane?!? Has anyone else got experience with DataFusion?!

TomNicholas · March 19, 2025, 12:36am

Yes and yes. That’s exactly what I’m suggesting above - do the rechunk in-memory instead, as fast as possible (e.g. using rust)

Kind of. Currently that API is effectively the zarr store API. If the zarr store is on object storage then the rechunk would be done via S3, and if it’s a LocalStore it would be done via the local filesystem. It would be interesting to see what happens if we simply swap it out for zarr’s MemoryStore, before even invoking rust. But this is the place to start - I explain all this on the issue I linked above.

github.com/cubed-dev/cubed

In-memory rechunk

opened 01:23AM - 15 Jul 24 UTC

TomNicholas

enhancement memory optimization storage

On some systems there are scenarios where we know we can perform a rechunk entir…ely in-memory without writing to disk. For example if I locally run this test which performs a single rechunk stage: ```python spec = cubed.Spec(work_dir="tmp", allowed_mem="400MB") # 80MB chunks a = cubed.random.random((10000, 10000), chunks=(10000, 1000), spec=spec) b = a.rechunk((1000, 10000)) b.visualize(optimize_graph=True) b.compute() ``` ``` array([[0.49041066, ..., 0.98331677], ..., [0.17032274, ..., 0.77605032]]) ``` ![image](https://github.com/user-attachments/assets/919e3ea1-059e-4bf0-9e5d-24e15afedb9b) then it writes an intermediate Zarr store to disk (along with the initial and final arrays), ``` ls tmp/cubed-20240714T181729-c3b4f32b-ccdc-4a8b-8d13-173f3b49e412 array-003.zarr array-004-int.zarr array-004.zarr ``` even though I know that my laptop has plenty enough RAM to materialize all the 10 80MB chunks needed for this rechunk into memory at once, which would have avoided writing that intermediate array to disk. This scenario of uneccessarily writing to disk occurs anytime that: a) there is some way that workers could potentially communicate data that isn't writing to persistent storage, b) The whole rechunk stage is small enough that we have `n_tasks * allowed_mem_per_task < total_system_memory`. (a) will never happen in a serverless cloud context, because AFAIK there is no other possible way that two lambda functions can communicate other than by writing to s3. But this is not true on other systems. HPC nodes can alternatively communicate via their interconnect (see https://github.com/cubed-dev/cubed/issues/467), and threads running on a single machine can potentially communicate via shared memory (see #492 and https://github.com/cubed-dev/cubed/issues/497). (b) requires an additional piece of information - the total memory available across the entire system. On a HPC cluster this is `mem_per_node * n_nodes`, and on a single machine this is just system RAM. Obviously we don't want to try to do an in-memory rechunk every time, so I propose that we add an additional parameter `total_system_mem` to the `Spec`, and allow executors the option to use a different rechunking strategy if condition (b) is fulfilled for a particular rechunk stage in the plan. That way we can use optimized in-memory shuffle algorithms for rechunks that don't need to be written to disk, whilst still writing out to disk for the rechunks that are definitely going to be too big to do in-memory. You might say "but Tom, I thought the whole point of Cubed was to [avoid a complicated centralized shuffle](https://github.com/cubed-dev/cubed/blob/main/docs/getting-started/why-cubed.md)?" But I think what I'm proposing is okay, because we have still broken up the shuffle problem into two clearly separate cases, the larger-than-system-memory case and the smaller-than-system-memory case. We can reason about each of these individually, we know which one we will need in advance, we can still control memory usage as the size of the data to be passed around in the shuffle is known, and we don't have a shuffle algorithm that has to awkwardly deal with both cases by spilling some data to disk when it suddenly realises half-way through that it's about to run out of RAM. cc @ayushnag @applio

Oh interesting, I didn’t know that.

We can dream about this, and in fact you’re not the first person to suggest it, but…

This is a massive project, that touches all layers of the stack,
You’re still going to want an fast in-memory rechunking algorithm that knows what to do if you ask it to rechunk more data than it has RAM available,
We could possibly achieve many of the benefits if this with a fraction of the work by plugging that speedy rechunker into Cubed.

You missed our distributed arrays working group meeting yesterday but you should come along on the 31st!

tomwhite · March 19, 2025, 1:03pm

Rechunking an ERA5-sized dataset (1.5TB) is an interesting problem, which Stephan Hoyer wrote a multi-stage algorithm for. I’ve recently run this using Cubed on AWS Lambda, but it would be interesting to see how it performs on a large multi-core machine.

Another area where there are definite improvements to be made are in reducing the number of buffer copies in reading and writing to Zarr. Tom Augspurger is looking at this in Codec pipeline memory usage · Issue #2904 · zarr-developers/zarr-python · GitHub, and I have been measuring the current state in GitHub - tomwhite/memray-array: Measuring memory usage of array storage operations using memray. Any improvements that could be made here would be applicable to many libraries that use Zarr.

TomNicholas · March 19, 2025, 1:49pm

Minor nit: Isn’t that the the size of just one ERA5 variable? And the logic is that if we can successfully rechunk that with <<1.5TB of RAM we could rechunk the entirety of ERA5?

tomwhite · March 19, 2025, 2:16pm

Yes, that’s absolutely right. Sorry I wasn’t clearer.

jack_kelly · March 24, 2025, 6:01pm

Building an in-memory rechunker in Rust

I really like this idea!

If I can free up a bunch of time (which is looking more likely than not) then this will be the first thing I work on. (Nerd snipe accepted!)

To check I understand correctly: Is the reason that in-memory rechunking is slow in Python (with numpy) because the Python code will have a bunch of nested for loops, where each iteration copies one contiguous sequence of elements from the source chunk(s) to the destination chunk(s)? Looping is slow in Python, and the iterations run sequentially on a single-CPU-core. So we can get a speedup in Rust because Rust can loop quickly, and by saturating the memory bus by using multiple CPU cores at once (Rust can share a single buffer across threads if we promise that each thread will write into non-overlapping regions of that buffer). Is that right?

I could start with rechunker that is entirely focused on in-memory rechunking on a single machine. I’m guessing this wouldn’t do any IO? Instead, you’d pass it a list of chunks that are already in memory. It’d return a new list of chunks in your requested chunk specification. It’d have a Python API that Cubed could use. Have I got the right idea?

If that works then I could expand the project to basically be a full Rust implementation of rechunker, focused on high performance on a single machine.

And, thank you @tomwhite for the link to Stephan’s multi-stage algorithm!

Distributed Arrays Working Group

I’ll attend on Monday 31st March!

Using `datafusion` to build an `xarray`-compute-backend

Ooh… Do you have links to previous discussions on using datafusion for multi-dimensional arrays? (I had a quick search through the xarray github issues but failed to find anything… I assume I’m looking in the wrong place!).

That said, I might be cooling off on the idea of using datafusion any time soon. The more I learn about datafusion, the more comlicated it sounds!

Reducing the number of buffer copies in reading and writing to Zarr

This definitely sounds interesting! But I may start with the rechunking work (assuming I get time to work on this stuff)…

TomNicholas · March 24, 2025, 8:03pm

Excellent

To check I understand correctly: Is the reason that in-memory rechunking is slow in Python (with numpy )

It’s actually way worse than that - Cubed isn’t even doing this in-memory right now, it’s doing a shitton of potentially uneccessary IO when running on a single-machine. The context is here, but basically for array computations rechunking is analogous to the “shuffle” in other distributed systems, and is the primary place where distributed array computations become slow or fall over. Cubed aims for stability at scale by starting with a bounded-memory scalable implementation of rechunk (i.e. the original Pangeo rechunker algorithm), and building everything around that. But the cost of this approach is that that algorithm writes out intermediate copies to disk, so it’s not an in-memory rechunk. That makes more sense in a massively-horizontal serverless cloud context (what Cubed was originally designed for) than it does running on a single-machine, which is why we’ve ended up with this optimization opportunity on a single machine.

I could start with rechunker that is entirely focused on in-memory rechunking on a single machine. I’m guessing this wouldn’t do any IO? Instead, you’d pass it a list of chunks that are already in memory. It’d return a new list of chunks in your requested chunk specification. It’d have a Python API that Cubed could use. Have I got the right idea?

If that works then I could expand the project to basically be a full Rust implementation of rechunker, focused on high performance on a single machine.

Exactly: the opportunity here is to teach Cubed when it can actually do the entire rechunk in-memory, saving a massive amount of IO whilst retaining the bounded-memory property. As you say, you could do that first with numpy, then once that works make it even faster with rust.

Ooh… Do you have links to previous discussions on using datafusion for multi-dimensional arrays?

No, this was only in-person - I don’t think anyone has actually tried this.

jack_kelly · March 25, 2025, 11:58am

Cool, this all makes sense, and thank you for taking the time to explain all this to me!

I defintely see your point that it’d be great to improve Cubed’s rechunking performance on a single machine. And, ultimately, I’d definitely like to help with that.

But, if it’s OK, I might start by building a Rust tool that rechunks Zarrs quickly on a single machine. With a Python API. Then we can figure out how to hook this into Cubed.

I must admit that I’m a little nervous about the idea of my first rechunking project being doing major surgery on Cubed, especially given that I don’t have any experience with the Cubed codebase! So, partly for selfish reasons (it’s more fun!) I’d like to start by building a stand-alone Rust project, so I can get to grips with rechunking (which should allow me to more quickly understand Cubed’s code), whilst hopefully also building a useful stand-alone tool.

And then we can explore ways to hook this into Cubed! I really like your idea that a tool that’s efficient at rechunking on a single machine could be a useful component in a multi-level rechunking system running across multiple machines (like Cubed).

Does that sound OK?

(I’ll be at the “Distributed Arrays Working Group” on Monday if you’d prefer to discuss then!)

TomNicholas · March 25, 2025, 3:18pm

That sounds fine too! Let’s discuss the details synchronously.

jack_kelly · March 26, 2025, 11:12am

Cool, sounds good!

BTW, a quick update on using datafusion to build an xarray-compute-backend… On this a thread on GitHub, maximedion2 concludes that it probably isn’t possible to teach datafusion to use arrow::Tensor. Which implies that it’s not possible to process multi-dimensional arrays in datafusion.

But datafusion’s architecture is very well documented. So it should be possible to borrow ideas from datafusion.

But, for now at least, I’m gonna focus on rechunking, and try to resist falling further down the rabit hole🐰. (Although the way that I’m currently thinking of limiting memory use in the rechunker is loosly inspired by the way that datafusion limits memory. Because it’s simple!).

Topic		Replies	Views
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? HPC	45	17425	July 5, 2025
What's Next - Software - Massive Scale	7	650	December 21, 2023
Extremely slow rechunking of Zarr store with xarray Data	16	3990	October 22, 2021
Optimizing climatology calculation with Xarray and Dask Science	33	4001	December 6, 2024
Struggling with large dataset loading/reading using xarray Science	39	15352	February 16, 2023

Why learn rust? A pangeo perspective

Building an in-memory rechunker in Rust

Distributed Arrays Working Group

Using datafusion to build an xarray-compute-backend

Reducing the number of buffer copies in reading and writing to Zarr

Related topics

Using `datafusion` to build an `xarray`-compute-backend