I’ve been struggling with high latency when opening a particular zarr store on S3 that has consolidated metadata. After much R&D (with much help from [1]), I’ve found that the latest s3fs
release (available on conda-forge), 0.5.1, largely fixes my problem, going from 50 sec to 4 sec delay.
The problem was that s3fs 0.4.2 is being installed when not pinned to a specific version. I was assuming the most recent version that doesn’t cause conflicts would be installed. The bare minimum conda env I need looks like this:
conda create -n myenv -c conda-forge python ipykernel matplotlib xarray zarr botocore boto3 s3fs fsspec
The breakthrough was adding version pinning: s3fs=0.5.1. Digging further, it looks like boto3
was the culprit, leading to the older s3fs version; if I create the env w/o boto3, s3fs 0.5.1 is installed. However, I’ve been using boto3.Session
to start a session with an .aws/credentials file. This still worked fine with s3fs=0.5.1, but for now I’ve switched to botocore.session.Session
and skipped the boto3 installation.
Hopefully this information will spare someone some pain. But it’d be great if the stack defaults didn’t lead to the older, less performant s3fs. Also, while this testing was largely done on my Ubuntu laptop, 0.4.2 is also what’s installed on the AWS Pangeo base notebook image; so, I’m still stuck with the high latency there. I’ll follow up on that odd latency in a separate post.
[1] https://github.com/dask/s3fs/issues/285, https://github.com/dask/s3fs/issues/279