Hi everyone,
this is a very general question, and I am not sure if this forum is the right place to ask, maybe the python forum would be better.
I am asking here because I noticed a trend in how some of the Pangeo stack libraries handle this.
A typical issue in IO libraries is that they would like to provide both a sync and an async interface.
There are a few ways to do that, for example SQLAlchemy uses the greenlet library (though I am not very familiar with their architecture), and azure-sdk just duplicates all code and provides both sync and async clients (sync in azure.storage.blob, async in azure.storage.blob.aio), the same goes for httpx, gql (GraphQL client).
But (some parts of) fsspec, zarr and dask-distributed seem to have only async implementation, and some mechanism that provides sync interface on top of this async implementation.
Namely, there is a separate dedicated IO thread running an event loop, and all sync method calls are just submitting work to this event loop. Usually there is also a convenient âsync wrapperâ that converts the async methods into sync.
- filesystem_spec/fsspec/asyn.py at master ¡ fsspec/filesystem_spec ¡ GitHub
- zarr-python/src/zarr/core/sync.py at main ¡ zarr-developers/zarr-python ¡ GitHub
Part of my day job is building "mini-sdk"s for external APIs and whatnot, and I have myself experimented with many ways how to do this generally and easily.
It seems that people from dask/fsspec/zarr have settled on this pattern, I am curious, has there been some discussion why this is preferred (I couldnât find any)? What are the advantages/disadvantages of this approach?
Have there been any attempts to polish this approach and publish it as a library, so that people can use it for building their own IO libraries? That way maybe the dedicated IO thread could be shared by multiple libraries, and it could become a standard way to solve this problem.
Thanks!