Extremly slow write to S3 bucket with xarray.Dataset.to_zarr

I just got off a call with @sotosoul, and I’m documenting what we learned here for the community.

The case he is testing turns out to exacerbate a well-known performance problem with Xarray / Zarr / S3. Specifically, he is writing a dataset with many data variables over a very high latency internet connection (writing an S3 bucket in us-west-2 from a laptop in Copenhagen).

The performance problem is that Xarray initializes the variables in a serial loop, each of which involves a round trip to S3. So the time to initialize the dataset scales roughly like N * number_of_variables * latency. If the latency is 50ms and you have 20 variables, you get about 10s (as in my example, run from the same cloud region). If your latency is 500ms (not unreasonably for a transatlantic ping), you get 200s.

But there is good news! There is a PR in the works in Xarray (by @dcherian) which will do all of these initialization operations concurrently, meaning that we should eliminate the number_of_variables scaling and instead just get initialization that scales with latency alone!

You can already see some massive speedups for this scenario reported there.:tada: I just tried that PR on this dataset, and my initialization time went from 12s to 368ms.

My hope is that we can have this feature released in Xarray very soon.

6 Likes