Why is Parquet/Pandas faster than Zarr/Xarray here?

rabernat · June 7, 2022, 6:37pm

I spent some time playing around with this. My conclusion is that the parquet format and pyarrow parquet reader are just more optimized for sub-selecting specific columns than the combination of zarr + xarray + dask. In your parquet dataset, each gauge_id is its own column. Parquet is specifically billed as a “columnar file format”. Supporting efficient selection of columns is a core design goal of the entire parquet + arrow ecosystem. The code path for this subsetting is likely implemented in C++ in the arrow core.

In contrast, with the Zarr dataset, gauge_id is a dimension. You have to first open the whole dataset, then do a selection to ask for the columns you need, and then do a compute, which triggers dask to start requesting chunks. Finally, these chunks have to be put together into a single contiguous numpy array. That’s just more work, and moreover, the code has not been specifically optimized for this task.

Turning on logging

It can be interesting to see how many http requests are being made. This is probably the biggest determinant of performance. Here is how I did it.

import logging
import sys

handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(logging.Formatter("%(name)s - %(levelname)s - %(message)s"))

level="DEBUG"
logger = logging.getLogger("fsspec")
if logger.hasHandlers():
    logger.handlers.clear()
logger.setLevel(getattr(logging, level))
logger.addHandler(handler)

Then when I run

time df_parquet = pd.read_parquet(fs.open(parquet_dataset), columns=sta)

I see the following

fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 381199183 - 381264719
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 380768177 - 381264711
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 381199183 - 381264719
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 380768177 - 381264711
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 362407460 - 380768058
CPU times: user 411 ms, sys: 216 ms, total: 627 ms
Wall time: 1.22 s

It’s interesting to see that, despite having requested 60 different columns, arrow decides to make only 5 http calls.

I couldn’t get logging to work with Zarr, either with our without Dask. (@martindurant - do you know why not?). However, I am quite certain that Zarr will require at least 60 different http calls (at least one per chunk). These Zarr chunks are quite small (1.4 MB), so this is likely a place where we are paying a price for having many small files. The upcoming support for sharding in Zarr (chunks within chunks) might mitigate this.

So bottom line, having a compact cloud optimized file format and a C++ optimized reader for that format is clearly better. It’s not surprising to me that Parquet + Arrow win this benchmark. This is precisely the use case they were optimized for, and they are just much more mature technologies than Zarr. The fact that it is only a 2x difference is, to me, actually quite encouraging!

That said, it’s nice to have some use cases to drive future optimization of Zarr. I’m optimistic that sharding could help. Having more diagnostics from fsspec (e.g. logs) would also help diagnose the difference.

Hope that is helpful.

Topic		Replies	Views
Best practice reading zarr from s3 Cloud	8	4929	July 28, 2022
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1226	April 23, 2022
Extremely slow rechunking of Zarr store with xarray Data	16	4223	October 22, 2021
ZCollection: a library for partitioning Zarr dataset Data	14	1348	April 11, 2022
Extremely slow xarray/zarr writes Data	5	726	August 22, 2024

Why is Parquet/Pandas faster than Zarr/Xarray here?

Turning on logging

Related topics