Async and sync interfaces

Hi everyone,
this is a very general question, and I am not sure if this forum is the right place to ask, maybe the python forum would be better.
I am asking here because I noticed a trend in how some of the Pangeo stack libraries handle this.

A typical issue in IO libraries is that they would like to provide both a sync and an async interface.
There are a few ways to do that, for example SQLAlchemy uses the greenlet library (though I am not very familiar with their architecture), and azure-sdk just duplicates all code and provides both sync and async clients (sync in azure.storage.blob, async in azure.storage.blob.aio), the same goes for httpx, gql (GraphQL client).

But (some parts of) fsspec, zarr and dask-distributed seem to have only async implementation, and some mechanism that provides sync interface on top of this async implementation.
Namely, there is a separate dedicated IO thread running an event loop, and all sync method calls are just submitting work to this event loop. Usually there is also a convenient “sync wrapper” that converts the async methods into sync.

Part of my day job is building "mini-sdk"s for external APIs and whatnot, and I have myself experimented with many ways how to do this generally and easily.

It seems that people from dask/fsspec/zarr have settled on this pattern, I am curious, has there been some discussion why this is preferred (I couldn’t find any)? What are the advantages/disadvantages of this approach?
Have there been any attempts to polish this approach and publish it as a library, so that people can use it for building their own IO libraries? That way maybe the dedicated IO thread could be shared by multiple libraries, and it could become a standard way to solve this problem.

Thanks!

5 Likes

This is a great question and something that library maintainers are constantly wrestling with.

The core issue is with how Python chose to implement async functionality, via the coroutine approach. Any function which wants to await something has to itself be an async function, and this pollutes the entire stack, forcing your entire program to become async.

The problem in the Pangeo-verse is that most data science users (our main user persona) don’t want to learn async programming…they just want their I/O to be fast when loading data over the network.

Our first foray into async once Python 3 came out was led by @martindurant, developer of fsspec, who figured out the clever I/O thread trick. This allowed fsspec libraries such as s3fs to fetch data using async libraries like aiobotocore while still presenting a standard sync interface to Zarr. This produced tangible performance benefits, e.g. when fetching many small objects concurrently.

Now this approach is propagating down the stack. Last year we implemented async in Zarr, allowing you to asynchronously get data from arrays, e.g.

await data = array.getitem(...)

However, we still needed a sync interface, so we opted to copy the fsspec I/O thread approach.

Even more recently, Xarray is now implementing some async methods for loading data, see

by @TomNicholas

and we are facing the exact same question: have a bunch of mostly duplicate code for sync vs. async load methods, or introduce yet another I/O thread. In this case, I personally argued that we should just duplicate things because this affects only a small number of methods.


You’ve correctly identified a problem, which is that it doesn’t make sense for there to be many of these I/O threads floating around. Furthermore, this approach is fragile, to say the least. There are all kinds of edge cases around how it interacts with other event loops, threading, processes, shutting down the threads cleanly, etc.

I think there would potentially be value in having a single, centralized I/O thread that the whole stack could use, as you proposed. In principle, Zarr, fsspec, and maybe Xarray could all be refactored to use a shared library. However, this is also risky–as mentioned above, the approach is pretty fragile and has been tuned delicately to work in all kinds of different situations.

Finally, it’s worth noting that a library does exist to address this problem:

However, it doesn’t work reliably for all of our use cases, so we didn’t end up using it.

2 Likes