I spent some time playing around with this. My conclusion is that the parquet format and pyarrow parquet reader are just more optimized for sub-selecting specific columns than the combination of zarr + xarray + dask. In your parquet dataset, each gauge_id is its own column. Parquet is specifically billed as a “columnar file format”. Supporting efficient selection of columns is a core design goal of the entire parquet + arrow ecosystem. The code path for this subsetting is likely implemented in C++ in the arrow core.
In contrast, with the Zarr dataset, gauge_id is a dimension. You have to first open the whole dataset, then do a selection to ask for the columns you need, and then do a compute, which triggers dask to start requesting chunks. Finally, these chunks have to be put together into a single contiguous numpy array. That’s just more work, and moreover, the code has not been specifically optimized for this task.
Turning on logging
It can be interesting to see how many http requests are being made. This is probably the biggest determinant of performance. Here is how I did it.
import logging
import sys
handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(logging.Formatter("%(name)s - %(levelname)s - %(message)s"))
level="DEBUG"
logger = logging.getLogger("fsspec")
if logger.hasHandlers():
logger.handlers.clear()
logger.setLevel(getattr(logging, level))
logger.addHandler(handler)
Then when I run
time df_parquet = pd.read_parquet(fs.open(parquet_dataset), columns=sta)
I see the following
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 381199183 - 381264719
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 380768177 - 381264711
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 381199183 - 381264719
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 380768177 - 381264711
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 362407460 - 380768058
CPU times: user 411 ms, sys: 216 ms, total: 627 ms
Wall time: 1.22 s
It’s interesting to see that, despite having requested 60 different columns, arrow decides to make only 5 http calls.
I couldn’t get logging to work with Zarr, either with our without Dask. (@martindurant - do you know why not?). However, I am quite certain that Zarr will require at least 60 different http calls (at least one per chunk). These Zarr chunks are quite small (1.4 MB), so this is likely a place where we are paying a price for having many small files. The upcoming support for sharding in Zarr (chunks within chunks) might mitigate this.
So bottom line, having a compact cloud optimized file format and a C++ optimized reader for that format is clearly better. It’s not surprising to me that Parquet + Arrow win this benchmark. This is precisely the use case they were optimized for, and they are just much more mature technologies than Zarr. The fact that it is only a 2x difference is, to me, actually quite encouraging!
That said, it’s nice to have some use cases to drive future optimization of Zarr. I’m optimistic that sharding could help. Having more diagnostics from fsspec (e.g. logs) would also help diagnose the difference.
Hope that is helpful.