New Cloud Tensor I/O Benchmarks - Zarr is fast now!

Given all of the recent development on Zarr, we decided to revisit the question of Zarr performance in the cloud. We recently published a blog post with our results.

Here’s the key figure

Bottom line: Zarr Python (using either Icechunk or Obstore as the storage engine) is now able to fully saturate the network between EC2 and S3, resulting in the physically maximum possible throughput when reading and writing data. This holds true in the presence of compression and over a wide range of chunk sizes.

In our benchmarks, Zarr (with Icechunk and Obstore) is now faster than Tensorstore or TileDB. It is also faster than Polars (limited to 1D arrays), one of the dataframe libraries.

The benchmarks are all open source, available here. You can run them yourself to verify the results.

The post also goes into some details about the long journey we’ve been on to improve Zarr performance. Many people from this community have contributed to Zarr on these topics. This progress is a huge win for the scientific data community. :heart:

11 Likes

That is so cool to see!

1 Like

The optimal chunk size for Icechunk is 3-15 MB

Interesting that this size range basically agrees with the statement in Performance guidelines for Amazon S3 - Amazon Simple Storage Service where it says:

Typical sizes for byte-range requests are 8 MB or 16 MB

In the blog post you only use a single machine and saturate the single-machine I/O, right?

But in the Pangeo community previously we used to promote using distributed dask clusters on multiple machines to not have single-machine I/O be a limitation, right?

I’m a bit confused.

There is nothing wrong with scaling out to multiple machines. But this is much more expensive and complex than a single machine. Scaling out should only be done once you’ve hit the limits of a single machine.

In the past, I believe we were guilty of using scaling-out to improve net throughput before we had reached the physical limits of a single machine. Ultimately this is a waste of money. It’s never going to be cheaper or faster to use a distributed cluster where a single-node multithreaded architecture will suffice. For example, in our 2020 Earthcube Poster we got throughputs of 3 GB/s (24 Gbps) on a cluster of 40 nodes. That implies that each individual node is only getting ~500 Mbps, well below the network limit. Now we are getting 12-20 Gbps on a single modest-sized VM!

Put conversely, with a cluster of 40 such nodes, you should be able to get throughput of 800 Gbps (100 GB/s). You should be able to process 1 TB of data in 10s. We have not done that experiment, but there is no reason to think it would not work.

6 Likes