I don’t understand the results I’m getting reading a Zarr dataset with s3fs=0.4.2
(non-async) vs s3fs=0.5.1
(async).
TL;DR: The dataset opens much faster with async, but when actually reading the data (with 20 workers) there is no significant speedup.
Here are the two notebooks:
s3fs=0.4.2
: https://nbviewer.jupyter.org/gist/rsignell-usgs/edec88157437523155cc27ca68f421c3
s3fs=0.5.1
: https://nbviewer.jupyter.org/gist/rsignell-usgs/cd7abe76904099fdda66f0553ba5e8fc
To open this National Water Model Zarr dataset with consolidated metadata in Xarray, it takes 1min 7s with the old s3fs, and only 6.7 s with the new s3fs async (10x faster!). I was thinking this must be because it loads the coordinate chunks faster, but the variables containing the coordinate data don’t have multiple chunks. So that’s the first thing I don’t understand.
When we get to actually reading the data with a cluster of 20 workers (older 0.6.1 version of Dask Gateway cluster on qhub), there is virtually no difference between the read times. For the big computation of the mean river transport over a year (13,000 tasks or so), the times are not significantly different (90 s vs 92 s). I would have expected some speedup, so that’s the second thing I don’t understand.
@martindurant, I imagine you have some ideas here…