Why is Parquet/Pandas faster than Zarr/Xarray here?

I’ve been telling people that for extracting streamflow time series data at selected stream gage locations, Parquet is nice, but Zarr should be just as good if chunked the same, and using the same compression schemes.

So I created an example, and they are both fast, but reading Parquet in Pandas is faster than reading Zarr in Xarray:

https://nbviewer.org/gist/15190e6d41c8bc8482056fec9e873bf7

Arguably we still might want to use Zarr because of the richer data model, but I am curious why Parquet/Pandas is faster here. Is it the handling of metadata?

1 Like

There are several dimensions to your comparison that go beyond just zarr vs parquet. You are loading the parquet data to pandas, while you’re loading the zarr data to xarray. So you could just as easily title your post “Why is Pandas faster than Xarray here?” And you are using Dask for the Xarray example, while your Pandas example does not use Dask.

I would try to remove these confounding factors if you goal is to simply benchmark one format vs. another.

I was curious what would be faster for a Pangeo user who wants to extract some streamflow data.

I didn’t think we could read Zarr in Pandas and I didn’t think we could read Parquet in Xarray.

I think that assessment is correct. But it just means that there are at least two confounding factors involved that make the direct comparison of “file format speed” more complicated.

If you want to find out why one is faster than another %prun is a great place to start. That will tell you where the code is spending its time.

I spent some time playing around with this. My conclusion is that the parquet format and pyarrow parquet reader are just more optimized for sub-selecting specific columns than the combination of zarr + xarray + dask. In your parquet dataset, each gauge_id is its own column. Parquet is specifically billed as a “columnar file format”. Supporting efficient selection of columns is a core design goal of the entire parquet + arrow ecosystem. The code path for this subsetting is likely implemented in C++ in the arrow core.

In contrast, with the Zarr dataset, gauge_id is a dimension. You have to first open the whole dataset, then do a selection to ask for the columns you need, and then do a compute, which triggers dask to start requesting chunks. Finally, these chunks have to be put together into a single contiguous numpy array. That’s just more work, and moreover, the code has not been specifically optimized for this task.

Turning on logging

It can be interesting to see how many http requests are being made. This is probably the biggest determinant of performance. Here is how I did it.

import logging
import sys

handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(logging.Formatter("%(name)s - %(levelname)s - %(message)s"))

level="DEBUG"
logger = logging.getLogger("fsspec")
if logger.hasHandlers():
    logger.handlers.clear()
logger.setLevel(getattr(logging, level))
logger.addHandler(handler)

Then when I run

time df_parquet = pd.read_parquet(fs.open(parquet_dataset), columns=sta)

I see the following

fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 381199183 - 381264719
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 380768177 - 381264711
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 381199183 - 381264719
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 380768177 - 381264711
fsspec - DEBUG - <File-like object S3FileSystem, rsignellbucket1/testing/parquet/pred.parquet> read: 362407460 - 380768058
CPU times: user 411 ms, sys: 216 ms, total: 627 ms
Wall time: 1.22 s

It’s interesting to see that, despite having requested 60 different columns, arrow decides to make only 5 http calls.

I couldn’t get logging to work with Zarr, either with our without Dask. (@martindurant - do you know why not?). However, I am quite certain that Zarr will require at least 60 different http calls (at least one per chunk). These Zarr chunks are quite small (1.4 MB), so this is likely a place where we are paying a price for having many small files. The upcoming support for sharding in Zarr (chunks within chunks) might mitigate this.

So bottom line, having a compact cloud optimized file format and a C++ optimized reader for that format is clearly better. It’s not surprising to me that Parquet + Arrow win this benchmark. This is precisely the use case they were optimized for, and they are just much more mature technologies than Zarr. The fact that it is only a 2x difference is, to me, actually quite encouraging!

That said, it’s nice to have some use cases to drive future optimization of Zarr. I’m optimistic that sharding could help. Having more diagnostics from fsspec (e.g. logs) would also help diagnose the difference.

Hope that is helpful.

2 Likes

Yes, some points of interest in this:

  • the parquet data is a single file, so the most efficient way to read it is with a single HTTP call, and that’s that you would get if you used fsspec.parquet ; arrow does this sort of optimization for the special case of s3 only
  • xarray is loading and converting the times eagerly in the main thread on open, even though we don’t use them. I also see variables like order being read by xarray (because of load?).
  • if you use dask, you are making many small tasks and adding overhead as well as concat costs; you can mitigate with chunks={"gage_id": 100}
  • if you don’t use dask, xarray does not appear to be triggering massively concurrent fetches of all of the chunks, and I don’t know why.
2 Likes

Martin how are you debugging this? What tools do you use to see what fsspec is doing?

I would pdb in the zarr store first, to see what fsspec thing is getting called. I don’t think I told you I was not working this week (it came about as a last minute decision), but I can look into it when I find some minutes.

1 Like

Although Parquet/Pandas is faster, both are still so fast that I think I’d stick with Zarr/Xarray because of the richer data model.

I was just curious about the performance differences, and thanks to @rabernat and @martindurant, I now have a better understanding!

1 Like