Async and sync interfaces

vladidobro · June 3, 2025, 9:34pm

Hi everyone,
this is a very general question, and I am not sure if this forum is the right place to ask, maybe the python forum would be better.
I am asking here because I noticed a trend in how some of the Pangeo stack libraries handle this.

A typical issue in IO libraries is that they would like to provide both a sync and an async interface.
There are a few ways to do that, for example SQLAlchemy uses the greenlet library (though I am not very familiar with their architecture), and azure-sdk just duplicates all code and provides both sync and async clients (sync in azure.storage.blob, async in azure.storage.blob.aio), the same goes for httpx, gql (GraphQL client).

But (some parts of) fsspec, zarr and dask-distributed seem to have only async implementation, and some mechanism that provides sync interface on top of this async implementation.
Namely, there is a separate dedicated IO thread running an event loop, and all sync method calls are just submitting work to this event loop. Usually there is also a convenient “sync wrapper” that converts the async methods into sync.

Part of my day job is building "mini-sdk"s for external APIs and whatnot, and I have myself experimented with many ways how to do this generally and easily.

It seems that people from dask/fsspec/zarr have settled on this pattern, I am curious, has there been some discussion why this is preferred (I couldn’t find any)? What are the advantages/disadvantages of this approach?
Have there been any attempts to polish this approach and publish it as a library, so that people can use it for building their own IO libraries? That way maybe the dedicated IO thread could be shared by multiple libraries, and it could become a standard way to solve this problem.

Thanks!

rabernat · June 4, 2025, 1:51pm

This is a great question and something that library maintainers are constantly wrestling with.

The core issue is with how Python chose to implement async functionality, via the coroutine approach. Any function which wants to await something has to itself be an async function, and this pollutes the entire stack, forcing your entire program to become async.

The problem in the Pangeo-verse is that most data science users (our main user persona) don’t want to learn async programming…they just want their I/O to be fast when loading data over the network.

Our first foray into async once Python 3 came out was led by @martindurant, developer of fsspec, who figured out the clever I/O thread trick. This allowed fsspec libraries such as s3fs to fetch data using async libraries like aiobotocore while still presenting a standard sync interface to Zarr. This produced tangible performance benefits, e.g. when fetching many small objects concurrently.

Now this approach is propagating down the stack. Last year we implemented async in Zarr, allowing you to asynchronously get data from arrays, e.g.

await data = array.getitem(...)

However, we still needed a sync interface, so we opted to copy the fsspec I/O thread approach.

Even more recently, Xarray is now implementing some async methods for loading data, see

github.com/pydata/xarray

Add asynchronous load method

main ← TomNicholas:async.load

opened 04:05PM - 16 May 25 UTC

TomNicholas

+500 -66

Adds an `.async_load()` method to `Variable`, which works by plumbing async `get…_duck_array` all the way down until it finally gets to the async methods zarr v3 exposes. Needs a lot of refactoring before it could be merged, but it works. - [x] Closes #10326 - [x] Tests added - [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [x] New functions/methods are listed in `api.rst` API: - [x] `Variable.load_async` - [x] `DataArray.load_async` - [x] `Dataset.load_async` - [ ] `DataTree.load_async` - [ ] `load_dataset`? - [ ] `load_dataarray`?

by @TomNicholas

and we are facing the exact same question: have a bunch of mostly duplicate code for sync vs. async load methods, or introduce yet another I/O thread. In this case, I personally argued that we should just duplicate things because this affects only a small number of methods.

You’ve correctly identified a problem, which is that it doesn’t make sense for there to be many of these I/O threads floating around. Furthermore, this approach is fragile, to say the least. There are all kinds of edge cases around how it interacts with other event loops, threading, processes, shutting down the threads cleanly, etc.

I think there would potentially be value in having a single, centralized I/O thread that the whole stack could use, as you proposed. In principle, Zarr, fsspec, and maybe Xarray could all be refactored to use a shared library. However, this is also risky–as mentioned above, the approach is pretty fragile and has been tuned delicately to work in all kinds of different situations.

Finally, it’s worth noting that a library does exist to address this problem:

However, it doesn’t work reliably for all of our use cases, so we didn’t end up using it.

Topic		Replies	Views
Understanding Async	6	3018	December 15, 2020
Zarr-Python 3 release Jobs zarr	6	294	January 10, 2025
Best practice reading zarr from s3 Cloud	8	4546	July 28, 2022
Synchronizer for Zarr + Dask on Kubernetes Data	10	1845	January 16, 2024
Puzzling S3 xarray.open_zarr latency Data	10	2645	August 20, 2021

Async and sync interfaces

Related topics